[00:00:01] T246212: Move wgULSLanguageDetection variable to CommonSettings.php and document it - https://phabricator.wikimedia.org/T246212 [00:00:05] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:00:24] (Still deploying.) [00:01:19] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T246212 Stop setting wgULSLanguageDetection in IS, set in CS (duration: 01m 05s) [00:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:35] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 03s) [00:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:02] (03PS8) 10Jforrester: Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) [00:05:18] (03PS7) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 [00:05:59] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson Finished cables handing off to chris for remaining steps name rack_name position switch p... [00:06:05] (03CR) 10jerkins-bot: [V: 04-1] Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [00:06:15] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Jclark-ctr) [00:06:18] (03CR) 10jerkins-bot: [V: 04-1] Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [00:06:31] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Volans) >>! In T243112#5922017, @Papaul wrote: > @Volans i ma trying the downtime command from cookbook to downtime a host before running the auto-... [00:06:59] (03PS9) 10Jforrester: Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) [00:08:38] (03CR) 10Jforrester: [C: 03+2] Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [00:09:38] (03Merged) 10jenkins-bot: Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [00:13:15] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T232140: Stop setting wgLogoHD from wgLogos (duration: 01m 05s) [00:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:22] T232140: Separate out logo handling into square image logos and long text/wordmark banner logos - https://phabricator.wikimedia.org/T232140 [00:15:05] (03PS2) 10Jforrester: Complete WikiPage/Article split and deprecate Page interface change Article::getTouched to Article::getPage()->getTouched() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572751 (https://phabricator.wikimedia.org/T239975) (owner: 10Art-Baltai) [00:15:12] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T232140: Merge definition of wgLogos and wgLogo (duration: 01m 04s) [00:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:47] (03PS3) 10Jforrester: extract2: Use Article::getPage()->getTouched(), not Article::getTouched [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572751 (https://phabricator.wikimedia.org/T239975) (owner: 10Art-Baltai) [00:17:01] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 04s) [00:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:26] (03PS8) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 [00:17:37] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) @Volans Thanks [00:18:13] (03CR) 10Jforrester: [C: 04-1] "Waiting for post-wmf.21 tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [00:18:50] (03CR) 10Jforrester: [C: 03+2] extract2: Use Article::getPage()->getTouched(), not Article::getTouched [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572751 (https://phabricator.wikimedia.org/T239975) (owner: 10Art-Baltai) [00:19:46] (03Merged) 10jenkins-bot: extract2: Use Article::getPage()->getTouched(), not Article::getTouched [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572751 (https://phabricator.wikimedia.org/T239975) (owner: 10Art-Baltai) [00:21:27] !log jforrester@deploy1001 Synchronized w/extract2.php: T239975: Use Article::getPage()->getTouched(), not Article::getTouched (duration: 01m 04s) [00:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:35] T239975: Complete WikiPage/Article split and deprecate Page interface - https://phabricator.wikimedia.org/T239975 [00:24:34] Prod clear. [00:24:57] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2009.codfw.wmnet ` The log can be fou... [00:25:20] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2010.codfw.wmnet ` The log can be fou... [00:27:55] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10Jclark-ctr) @wiki_willy I have checked our storage room we have no spares host is 5 years old at the time drive needed is a 300gb 15k sas. current drive in... [00:39:52] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:10] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:06] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2009.codfw.wmnet'] ` and were **ALL** successful. [00:49:43] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:17] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10wiki_willy) @aborrero (and @Jclark-ctr for visibility) - it looks this was purchased back in 2014, and past the 5yr server life cycle. Would it be possible to... [00:51:07] (03PS1) 10CDanis: Revert "Depool esams (hardware troubles)" [dns] - 10https://gerrit.wikimedia.org/r/575105 [00:52:03] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:01] (03CR) 10Ayounsi: [C: 03+1] Revert "Depool esams (hardware troubles)" [dns] - 10https://gerrit.wikimedia.org/r/575105 (owner: 10CDanis) [00:55:39] (03CR) 10CDanis: [C: 03+2] Revert "Depool esams (hardware troubles)" [dns] - 10https://gerrit.wikimedia.org/r/575105 (owner: 10CDanis) [00:55:45] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2011.codfw.wmnet ` The log can be fou... [00:56:31] !log repool esams 🙌 😎 [00:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:52] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2010.codfw.wmnet'] ` and were **ALL** successful. [01:00:04] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T0100). [01:01:41] 10Operations, 10MediaWiki-General, 10observability: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) Per @fgiunchedi recommendation, I put together a [[ https://github.com/shdubsh/prometheus_client_php/tree/DirectFileStore | very basic mockup of how DirectFileStore might... [01:06:34] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.36 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:09:20] ^ expected [01:09:39] codfw was getting most US traffic, and now isn't [01:10:14] (03PS1) 10BryanDavis: webservice-runner: Fix --extra-args handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106 [01:10:45] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:09] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:57] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2011.codfw.wmnet'] ` and were **ALL** successful. [01:18:02] (03CR) 10BryanDavis: [C: 04-1] webservice-runner: Fix --extra-args handling (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106 (owner: 10BryanDavis) [01:18:10] (03PS2) 10BryanDavis: webservice-runner: Fix --extra-args handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106 [01:19:44] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2012.codfw.wmnet ` The log can be fou... [01:22:35] (03PS3) 10BryanDavis: webservice-runner: Fix --extra-args handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106 [01:26:56] (03CR) 10BryanDavis: [C: 03+2] kubernetes: Remove deprecated flag from tcl image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/573823 (owner: 10BryanDavis) [01:27:08] (03CR) 10BryanDavis: [C: 03+2] webservice-runner: Fix --extra-args handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106 (owner: 10BryanDavis) [01:27:34] (03Merged) 10jenkins-bot: kubernetes: Remove deprecated flag from tcl image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/573823 (owner: 10BryanDavis) [01:27:38] !log re-enable BGP to telia in esams [01:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:44] (03Merged) 10jenkins-bot: webservice-runner: Fix --extra-args handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106 (owner: 10BryanDavis) [01:28:10] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2013.codfw.wmnet ` The log can be fou... [01:30:14] (03PS1) 10Holger Knust: WIP: changeprop/cpjobqueue: Added new config template for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) [01:30:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: changeprop/cpjobqueue: Added new config template for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [01:34:44] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:52] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:37:01] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:50] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2012.codfw.wmnet'] ` and were **ALL** successful. [01:43:09] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:27] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:25] (03PS2) 10Holger Knust: WIP: changeprop/cpjobqueue: Added new config template for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) [01:50:18] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2013.codfw.wmnet'] ` and were **ALL** successful. [01:51:11] (03PS1) 10BryanDavis: 3rd try at making extra_args handling "better" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575110 (https://phabricator.wikimedia.org/T244894) [01:51:44] (03CR) 10Holger Knust: "First draft. Will likely need to test some more tomorrow morning. These are just the changes to create the different config files based on" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [01:52:28] (03CR) 10BryanDavis: [C: 03+2] 3rd try at making extra_args handling "better" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575110 (https://phabricator.wikimedia.org/T244894) (owner: 10BryanDavis) [01:53:04] (03Merged) 10jenkins-bot: 3rd try at making extra_args handling "better" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575110 (https://phabricator.wikimedia.org/T244894) (owner: 10BryanDavis) [01:54:10] (03Abandoned) 10BryanDavis: Partially revert changes to improve support for extra_args [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/574236 (https://phabricator.wikimedia.org/T244894) (owner: 10Dapete) [02:02:20] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10CDanis) FWIW I think it would make sense to at least stick a Wikimedia logo there sooner rather than later. [02:07:14] (03PS1) 10BryanDavis: d/changelog: prepare 0.64 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575111 [02:07:45] (03CR) 10jerkins-bot: [V: 04-1] d/changelog: prepare 0.64 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575111 (owner: 10BryanDavis) [02:08:12] (03CR) 10Ppchelko: "Hm... Hmm..Hmm...Hm..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [02:08:27] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2014.codfw.wmnet ` The log can be fou... [02:08:33] (03CR) 10BryanDavis: "recheck" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575111 (owner: 10BryanDavis) [02:08:48] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2015.codfw.wmnet ` The log can be fou... [02:17:16] 10Operations, 10ops-codfw, 10Discovery: elastic2043 has hardware errors that trigger reboots - https://phabricator.wikimedia.org/T243715 (10Papaul) 05Open→03Resolved I Was able to upgrade the IDRAC as well, the Dell tech wasn't very helpful. I clear the log and drained the power on the 13th so what was m... [02:19:26] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw:fundraising single-cpu misc servers frpig2001,civi2001.pay-lvs200[1-2] - https://phabricator.wikimedia.org/T244950 (10Papaul) [02:22:44] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [02:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:25] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [02:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:58] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [02:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:26] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [02:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:43] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2015.codfw.wmnet'] ` and were **ALL** successful. [02:32:06] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2017.codfw.wmnet ` The log can be fou... [02:33:14] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2014.codfw.wmnet'] ` and were **ALL** successful. [02:35:49] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2016.codfw.wmnet ` The log can be fou... [02:47:04] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [02:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:37] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [02:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:26] (03PS1) 10CDanis: style: add Wikimedia Foundation logo [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) [02:50:46] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [02:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:17] (03CR) 10CDanis: "Not 100% sure of this, nor how to test, but making an attempt anyway :)" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis) [02:53:05] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [02:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:22] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2017.codfw.wmnet'] ` and were **ALL** successful. [02:57:51] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2016.codfw.wmnet'] ` and were **ALL** successful. [03:11:41] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2018.codfw.wmnet ` The log can be fou... [03:12:19] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2019.codfw.wmnet ` The log can be fou... [03:26:39] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:18] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:54] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:20] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:40] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2018.codfw.wmnet'] ` and were **ALL** successful. [03:35:35] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2020.codfw.wmnet ` The log can be fou... [03:37:08] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2019.codfw.wmnet'] ` and were **ALL** successful. [03:50:34] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:48] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:47] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) [03:56:17] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) All parse nodes are ready for service just missing parse200[7-8] i think the problem is a wrong mgmt password. I will look into this once a... [03:57:34] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2020.codfw.wmnet'] ` and were **ALL** successful. [04:02:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:04:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:41:26] 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Aklapper) @MoritzMuehlenhoff: Could you please answer the last comment? Thanks! [05:33:13] 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) 05Open→03Stalled [05:40:31] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 113.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [05:54:06] (03CR) 10Gergő Tisza: "Scheduled for SWAT tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) (owner: 10Gergő Tisza) [06:12:07] (03PS1) 10Marostegui: Revert "db1084: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/575130 [06:12:15] (03CR) 10Marostegui: [C: 04-2] "Needs to catch up" [puppet] - 10https://gerrit.wikimedia.org/r/575130 (owner: 10Marostegui) [06:12:29] 10Operations, 10ops-eqiad, 10DC-Ops: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 (10Marostegui) Thanks John: ` Battery/Capacitor Count: 1 Battery/Capacitor Status: OK ` I have also started MySQL, but it needs catching up. I will take care from here. Thanks again! [06:40:43] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.079e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [06:55:49] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 8728 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [07:03:59] (03CR) 10Muehlenhoff: Re-enable CAS authentication after enabling CASValidateSAML (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [07:20:11] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [07:30:18] (03PS1) 10Muehlenhoff: Adapt cross-validate-accounts for system users [puppet] - 10https://gerrit.wikimedia.org/r/575141 (https://phabricator.wikimedia.org/T235161) [07:31:17] I am going to depool and create a dcops ticket for db1098 [07:35:23] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 116.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [07:39:44] 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10jijiki) >>! In T245841#5919699, @Joe wrote: > > What would having all scap proxies also be mcrouter proxies change in terms of the scenario you described above? > This w... [07:45:10] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) 05Open→03Resolved No differences found on s3, s2 tables between source backups and production. Issue fixed. [07:53:15] (03PS1) 10Muehlenhoff: Unroll Partman configs for Ganeti-based clusters [puppet] - 10https://gerrit.wikimedia.org/r/575202 (https://phabricator.wikimedia.org/T156955) [07:58:29] (03PS1) 10Vgutierrez: lvs: Replace lvs2006 with lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/575203 (https://phabricator.wikimedia.org/T196560) [08:06:19] (03PS2) 10Aaron Schulz: [DNM] Use DBO_DEFAULT for extension1 since it is not for key/value blob storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 [08:08:09] (03Abandoned) 10Aaron Schulz: Move duplicated RDBMS host lists to ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524695 (owner: 10Aaron Schulz) [08:14:49] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1098 at 50%', diff saved to https://phabricator.wikimedia.org/P10535 and previous config saved to /var/cache/conftool/dbconfig/20200227-081449-jynus.json [08:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:49] 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10MoritzMuehlenhoff) While the keyserver networks have some structural issues which are pending some changes and a number of keys... [08:26:48] !log killed SpecialFewestRevisions::reallyDoQuery long running query on db1101:s8, causing lag [08:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:13] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [08:27:22] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1003/21107/" [puppet] - 10https://gerrit.wikimedia.org/r/575203 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [08:40:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice! It's missing a few files and since we want to deduplicate this and have a single file for all 3 clusters, I 'd prefer if we don't di" (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [08:47:29] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10MoritzMuehlenhoff) @HMarcus We talked about this in yesterday's Infrastructure Foundations SRE; we would avoid to query the LDAP endpoint of J... [08:51:23] (03PS2) 10Gehel: airflow: Drop old airflow user/group statement [puppet] - 10https://gerrit.wikimedia.org/r/574538 (owner: 10EBernhardson) [08:54:23] (03CR) 10Gehel: [C: 03+2] airflow: Drop old airflow user/group statement [puppet] - 10https://gerrit.wikimedia.org/r/574538 (owner: 10EBernhardson) [09:01:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574888 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [09:03:45] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1098 (s6 & s7)', diff saved to https://phabricator.wikimedia.org/P10536 and previous config saved to /var/cache/conftool/dbconfig/20200227-090344-jynus.json [09:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me! I'll push an updated change to the "de" locale when this is merged (it also needs to be switched from formal to informal" (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [09:05:51] (03CR) 10Alexandros Kosiaris: configmaster: Add DNS Discovery discrepancy check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [09:07:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] Unroll Partman configs for Ganeti-based clusters [puppet] - 10https://gerrit.wikimedia.org/r/575202 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:10:36] (03CR) 10Muehlenhoff: [C: 03+2] Unroll Partman configs for Ganeti-based clusters [puppet] - 10https://gerrit.wikimedia.org/r/575202 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:12:10] (03PS8) 10Alexandros Kosiaris: configmaster: Add DNS Discovery discrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963 [09:16:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 'll merge this. It has gotten 3 +1s on principle up to now and I have addressed various implementation comments. Hopefully it will prove" [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [09:17:04] 10Operations, 10ops-eqiad: db1098 power redundancy lost - https://phabricator.wikimedia.org/T246323 (10jcrespo) [09:19:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [09:19:34] (03CR) 10Jbond: [V: 03+2 C: 03+2] templates: add initial templates to provide git history [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574888 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [09:21:12] 10Operations, 10ops-eqiad: db1098 power redundancy lost - https://phabricator.wikimedia.org/T246323 (10jcrespo) @wiki_willy This could be a power supply failure or other power connectivity issue, there is only so much we can check remotely. We need an onsite check. The server is depooled from production out of... [09:22:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:12] ACKNOWLEDGEMENT - IPMI Sensor Status on db1098 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Jcrespo Power redundancy lost. Ticket: https://phabricator.wikimedia.org/T246323 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:23:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:26:28] (03PS1) 10Jbond: docker: add docker files to make testing easier [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575208 [09:27:07] (03CR) 10Jbond: [V: 03+2 C: 03+2] docker: add docker files to make testing easier [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575208 (owner: 10Jbond) [09:27:40] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [09:35:51] !log upgrade and restart db1084 T246323 [09:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:58] T246323: db1098 power redundancy lost - https://phabricator.wikimedia.org/T246323 [09:36:50] (03CR) 10Jbond: "> Patch Set 1:" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis) [09:37:19] (03PS1) 10Alexandros Kosiaris: icinga: Fix disc_desired_state mode [puppet] - 10https://gerrit.wikimedia.org/r/575209 [09:38:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] icinga: Fix disc_desired_state mode [puppet] - 10https://gerrit.wikimedia.org/r/575209 (owner: 10Alexandros Kosiaris) [09:41:06] (03CR) 10CDanis: "> Patch Set 1:" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis) [09:49:09] (03PS1) 10Alexandros Kosiaris: discovery: emit Output for the OK case as well [puppet] - 10https://gerrit.wikimedia.org/r/575211 [09:50:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] discovery: emit Output for the OK case as well [puppet] - 10https://gerrit.wikimedia.org/r/575211 (owner: 10Alexandros Kosiaris) [09:52:18] (03PS12) 10Muehlenhoff: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 [09:53:50] (03CR) 10Jbond: style: remove branding (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [09:55:39] (03CR) 10Muehlenhoff: [C: 03+2] Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [09:56:36] (03CR) 10Muehlenhoff: [C: 03+1] style: remove branding (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [10:03:56] (03CR) 10Volans: "Alternative proposal inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575141 (https://phabricator.wikimedia.org/T235161) (owner: 10Muehlenhoff) [10:04:42] (03PS1) 10Muehlenhoff: Fix netboot.cfg syntax [puppet] - 10https://gerrit.wikimedia.org/r/575212 [10:05:34] (03PS1) 10Muehlenhoff: Update German login dialogue to refer to Wikimedia Developer Name in other i18ns as well [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575213 [10:06:06] (03CR) 10Filippo Giunchedi: [C: 03+1] Fix netboot.cfg syntax [puppet] - 10https://gerrit.wikimedia.org/r/575212 (owner: 10Muehlenhoff) [10:06:40] (03CR) 10Muehlenhoff: [C: 03+2] Fix netboot.cfg syntax [puppet] - 10https://gerrit.wikimedia.org/r/575212 (owner: 10Muehlenhoff) [10:11:26] (03PS1) 10Jbond: templates: add templates base templates used for cas pages [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575214 (https://phabricator.wikimedia.org/T233939) [10:11:48] 10Operations, 10Puppet: Enable strict_hostname_checking on our Puppet nodes - https://phabricator.wikimedia.org/T246327 (10MoritzMuehlenhoff) [10:13:22] 10Operations, 10MediaWiki-General, 10observability: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Joe) As I repeatedly reiterated, the big issue here is prometheus has a model (pull) that really doesn't work well with the PHP request management model, which is shared-nothing. M... [10:14:36] 10Operations, 10MediaWiki-General, 10observability, 10serviceops: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Joe) [10:17:58] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "smokeping: temp cr2-esams disable" [puppet] - 10https://gerrit.wikimedia.org/r/574742 (https://phabricator.wikimedia.org/T246009) (owner: 10Filippo Giunchedi) [10:19:01] (03PS3) 10Matěj Suchánek: Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 [10:19:53] (03PS1) 10Jbond: themes: don't use externally hosted js/css files [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575215 [10:21:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, echoing what Keith mentioned re: DNS patch for cas-logstash" [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff) [10:22:07] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [10:22:12] (03PS2) 10Jbond: themes: don't use externally hosted js/css files [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575215 [10:24:05] (03CR) 10Volans: [C: 03+1] "++ to not use external CSS/JS" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575215 (owner: 10Jbond) [10:25:19] (03PS3) 10Jbond: themes: don't use externally hosted js/css files [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575215 (https://phabricator.wikimedia.org/T246010) [10:28:30] (03PS11) 10Muehlenhoff: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 [10:31:31] (03CR) 10Filippo Giunchedi: logstash, mediawiki: minor fixes in log streaming (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [10:32:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/574993 (owner: 10Volans) [10:32:39] (03CR) 10Jbond: "lgtm, however looks like it still needs auth from greg" [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053) (owner: 10Dzahn) [10:38:26] (03CR) 10Jbond: [C: 03+1] Enable CAS endpoint for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff) [10:43:24] (03CR) 10Vgutierrez: [C: 03+2] lvs: Replace lvs2006 with lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/575203 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [10:45:36] (03PS1) 10Filippo Giunchedi: swift: use fleetwide uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918) [10:47:00] 10Operations, 10Puppet: Enable strict_hostname_checking on our Puppet nodes - https://phabricator.wikimedia.org/T246327 (10jbond) I never realised it fell back to the [[ https://puppet.com/docs/puppet/latest/lang_node_definitions.html#matching | fqdn then host + domain facts ]], surprised this hasn't come up... [10:48:10] 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10fgiunchedi) +1 to bumping the limit, although the snipped above has `20` not `200` as the limit for pybal if I'm reading correctly [10:54:30] !log replacing lvs2006 with lvs2010 - T196560 T245984 [10:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:37] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [10:54:37] T196560: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 [10:55:29] (03CR) 10Muehlenhoff: [C: 03+2] Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff) [10:57:26] (03PS1) 10Jbond: puppetmaster: enable strict_hostname_checking[1] [puppet] - 10https://gerrit.wikimedia.org/r/575220 (https://phabricator.wikimedia.org/T246327) [10:58:06] (03CR) 10Jcrespo: [C: 03+1] "Caught up, will pool with low load." [puppet] - 10https://gerrit.wikimedia.org/r/575130 (owner: 10Marostegui) [10:58:16] (03PS2) 10Jcrespo: Revert "db1084: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/575130 (owner: 10Marostegui) [10:58:44] !log stop pybal on lvs2003 to let lvs2010 take the traffic for a little bit - T196560 T245984 [10:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:55] !log start pybal on lvs2003 - T196560 T245984 [11:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:02] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [11:03:03] T196560: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 [11:03:04] (03CR) 10Jcrespo: [C: 03+2] Revert "db1084: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/575130 (owner: 10Marostegui) [11:03:56] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update German login dialogue to refer to Wikimedia Developer Name in other i18ns as well [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575213 (owner: 10Muehlenhoff) [11:09:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Thanks!" [labs/private] - 10https://gerrit.wikimedia.org/r/574806 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [11:13:28] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Vgutierrez) [11:14:25] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Vgutierrez) a:03Vgutierrez [11:16:15] (03PS1) 10Raimond Spekking: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575221 (https://phabricator.wikimedia.org/T246330) [11:23:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/575220 (https://phabricator.wikimedia.org/T246327) (owner: 10Jbond) [11:25:16] (03CR) 10Muehlenhoff: swift: use fleetwide uid/gid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi) [11:27:22] (03PS1) 10Vgutierrez: lvs: Decomm lvs2006 [puppet] - 10https://gerrit.wikimedia.org/r/575222 (https://phabricator.wikimedia.org/T246329) [11:31:08] (03PS4) 10Giuseppe Lavagetto: mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 (https://phabricator.wikimedia.org/T244843) [11:32:06] (03PS2) 10Vgutierrez: lvs: Decomm lvs2006 [puppet] - 10https://gerrit.wikimedia.org/r/575222 (https://phabricator.wikimedia.org/T246329) [11:35:21] !log pause item migration script at Q50 million T219123 [11:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:26] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [11:35:59] (03PS3) 10Vgutierrez: lvs: Decomm lvs2006 [puppet] - 10https://gerrit.wikimedia.org/r/575222 (https://phabricator.wikimedia.org/T246329) [11:39:26] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) example of https://gerrit.wikimedia.org/r/c/operations/software/cas-overlay-template/+/575118 {F31646710} [11:40:14] (03PS5) 10Giuseppe Lavagetto: mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 (https://phabricator.wikimedia.org/T244843) [11:40:36] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks sane: https://puppet-compiler.wmflabs.org/compiler1003/21113/" [puppet] - 10https://gerrit.wikimedia.org/r/575222 (https://phabricator.wikimedia.org/T246329) (owner: 10Vgutierrez) [11:40:57] (03PS2) 10Jbond: style: add Wikimedia Foundation logo [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis) [11:43:15] (03CR) 10Jbond: "> Patch Set 1:" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis) [11:45:44] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1084 at 10% T245621', diff saved to https://phabricator.wikimedia.org/P10538 and previous config saved to /var/cache/conftool/dbconfig/20200227-114542-jynus.json [11:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:51] T245621: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 [11:47:08] !log vgutierrez@cumin2001 START - Cookbook sre.hosts.decommission [11:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:47] !log vgutierrez@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [11:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:53] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2006.codfw.wmnet` - lvs2006.codfw.wmnet (**PASS**) - Downtime... [11:48:05] !log run decommision script against lvs2006.codfw.wmnet - T246329 [11:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:10] T246329: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 [11:48:36] volans: ^^ logging the decomm script without the FQDN is actually... futile [11:48:52] I know I know... [11:48:58] :'( [11:55:11] (03PS1) 10Vgutierrez: Remove lvs2006 production entries [dns] - 10https://gerrit.wikimedia.org/r/575223 (https://phabricator.wikimedia.org/T246329) [11:56:50] (03CR) 10Vgutierrez: [C: 03+2] Remove lvs2006 production entries [dns] - 10https://gerrit.wikimedia.org/r/575223 (https://phabricator.wikimedia.org/T246329) (owner: 10Vgutierrez) [11:57:40] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Volans) The host has been down a week, hence it has been removed from PuppetDB and the Netbox report catched it. Updated Netbox setting it's state to Failed. Please follow... [11:58:31] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Vgutierrez) a:05Vgutierrez→03Papaul [11:59:38] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul lvs2006 is all yours, I've filed T246329 [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1200). [12:00:05] No GERRIT patches in the queue for this window AFAICS. [12:00:51] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [12:01:49] * Urbanecm steals SWAT [12:01:55] (03CR) 10Urbanecm: [C: 03+2] Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575221 (https://phabricator.wikimedia.org/T246330) (owner: 10Raimond Spekking) [12:02:56] (03Merged) 10jenkins-bot: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575221 (https://phabricator.wikimedia.org/T246330) (owner: 10Raimond Spekking) [12:05:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: daee105: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons (T246330) (duration: 01m 05s) [12:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:12] T246330: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T246330 [12:05:41] (03PS1) 10Vgutierrez: lvs: Replace lvs2003 with lvs2009 [puppet] - 10https://gerrit.wikimedia.org/r/575224 (https://phabricator.wikimedia.org/T196560) [12:06:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: daee105: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons (T246330; take II) (duration: 01m 04s) [12:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:08] (03PS4) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574020 (https://phabricator.wikimedia.org/T240941) [12:07:26] 10Operations, 10Service-Architecture: Many objects in conftool have pooled=yes, weight=0 - https://phabricator.wikimedia.org/T245594 (10Joe) [12:07:48] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: install envoy as a forward proxy everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/575225 (https://phabricator.wikimedia.org/T244843) [12:08:07] (03PS5) 10Hnowlan: Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) [12:08:21] (03CR) 10jerkins-bot: [V: 04-1] Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [12:08:33] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10Vgutierrez) [12:08:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis) [12:09:04] (03PS2) 10Vgutierrez: lvs: Replace lvs2003 with lvs2009 [puppet] - 10https://gerrit.wikimedia.org/r/575224 (https://phabricator.wikimedia.org/T196560) [12:10:55] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1003/21116/" [puppet] - 10https://gerrit.wikimedia.org/r/575224 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [12:11:15] !log EU SWAT done [12:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:30] (03PS6) 10Hnowlan: Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) [12:13:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] role::mediawiki::common: install envoy as a forward proxy everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/575225 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:14:48] !log replace lvs2003 with lvs2009 - T196560 T245984 T246334 [12:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:56] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [12:14:56] T196560: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 [12:14:56] T246334: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 [12:15:21] (03CR) 10Hnowlan: Admin: Add changeprop namespace (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [12:17:07] jouncebot: now [12:17:07] For the next 0 hour(s) and 42 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1200) [12:17:14] Urbanecm: all done? :) [12:17:22] addshore: yes [12:17:30] If so I'm going to try that good old item term config read patch to 6 million again :D [12:17:33] great [12:18:30] !log vgutierrez@cumin2001 START - Cookbook sre.hosts.decommission [12:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:40] !log vgutierrez@cumin2001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [12:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:57] FFS, that shouldn't log before I confirm it ;P [12:19:24] public blame included :D [12:19:27] !log vgutierrez@cumin2001 START - Cookbook sre.hosts.decommission [12:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:07] !log vgutierrez@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [12:20:07] (03PS1) 10Addshore: Read from the new term store again to Q6M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575226 [12:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:14] (03PS2) 10Addshore: Read from the new term store again to Q6M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575226 [12:20:15] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2003.codfw.wmnet` - lvs2003.codfw.wmnet (**PASS**) - Downtime... [12:20:24] (03CR) 10Addshore: [C: 03+2] Read from the new term store again to Q6M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575226 (owner: 10Addshore) [12:21:22] (03Merged) 10jenkins-bot: Read from the new term store again to Q6M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575226 (owner: 10Addshore) [12:24:06] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q6M (was Q2M) again (T219123) (duration: 01m 45s) [12:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:11] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [12:24:28] 12:24:05 1 hosts had failures restarting php-fpm [12:24:34] Urbanecm: ^^ did you also get this? [12:25:02] addshore: which one? [12:25:06] I don't think so [12:25:07] volans: https://phabricator.wikimedia.org/P10539 [12:25:12] debug :) [12:25:49] effie: might be related to anything ongoing on mwdebug2001? [12:26:21] * addshore is resyncing now anyway to make sure it deployed, will see if it pops up again [12:26:48] (03PS1) 10Vgutierrez: lvs: Decomm lvs2003 [puppet] - 10https://gerrit.wikimedia.org/r/575227 (https://phabricator.wikimedia.org/T246334) [12:26:50] volans: mm no, it should be working ok, I can take a look on mwdebug1001 [12:27:19] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q6M (was Q2M) again (T219123) cachebust? (duration: 01m 17s) [12:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:32] ^^ my second sync there had no errors or warnings [12:27:33] oh it is mwdebug2001 [12:30:24] (03CR) 10CDanis: [C: 03+1] style: add Wikimedia Foundation logo [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis) [12:30:26] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1003/21117/" [puppet] - 10https://gerrit.wikimedia.org/r/575227 (https://phabricator.wikimedia.org/T246334) (owner: 10Vgutierrez) [12:31:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::mediawiki::common: install envoy as a forward proxy everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/575225 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:33:25] (03PS1) 10Addshore: Read from the new term store up to Q8 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575231 (https://phabricator.wikimedia.org/T219123) [12:33:40] (03PS1) 10Vgutierrez: Remove lvs2003 production entries [dns] - 10https://gerrit.wikimedia.org/r/575232 (https://phabricator.wikimedia.org/T246334) [12:33:57] (03CR) 10Addshore: [C: 03+2] Read from the new term store up to Q8 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575231 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [12:35:01] (03Merged) 10jenkins-bot: Read from the new term store up to Q8 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575231 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [12:35:03] (03CR) 10Vgutierrez: [C: 03+2] Remove lvs2003 production entries [dns] - 10https://gerrit.wikimedia.org/r/575232 (https://phabricator.wikimedia.org/T246334) (owner: 10Vgutierrez) [12:35:48] (03CR) 10Elukey: [C: 03+2] Add an-launcher1001 to profile::dumps::distribution [puppet] - 10https://gerrit.wikimedia.org/r/575048 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [12:36:31] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q8M (was Q6M) again (T219123) (duration: 01m 04s) [12:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:37] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [12:37:45] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q8M (was Q6M) again (T219123) ?cachebust (duration: 01m 03s) [12:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:06] 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10ayounsi) The syntax is not obvious, `maximum 1000 teardown 20` means shutdown the session at 1000 but start sending warning logs at 20% of the 1000. [12:41:11] !log bump BGP prefix-limit on all routers - T246110 [12:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:16] T246110: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 [12:43:50] PROBLEM - Check systemd state on mw2299 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:22] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10Vgutierrez) a:05Vgutierrez→03Papaul [12:49:03] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul same for lvs2003: T246334 Regarding lvs2007 and lvs2008, please update the NICs FW to the same versions as you did for lvs2009 and lvs... [12:51:09] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [12:51:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] style: add Wikimedia Foundation logo [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis) [12:52:15] 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10ayounsi) 05Open→03Resolved Done. [12:56:26] !log delete specific tcp-mss on cr2-eqiad:equinix (will cause an interface flap) - T244610 [12:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:09] actually no interface flap as there is another "global" one still in effect [12:57:10] (03PS1) 10Addshore: Read from the new term store, back to Q2 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575235 (https://phabricator.wikimedia.org/T219123) [12:59:19] PROBLEM - Check systemd state on mw2258 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:19] PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:01] PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:35] PROBLEM - Check systemd state on mw2147 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:53] PROBLEM - Check systemd state on mw2319 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:53] PROBLEM - Check systemd state on mw2296 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:56] uhm [13:01:11] is that expected? I have not read any backlog [13:01:17] PROBLEM - Check systemd state on mw2183 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:23] PROBLEM - Check systemd state on mw2195 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:25] looking [13:01:27] PROBLEM - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:33] _joe_: ● envoyproxy.service loaded failed failed Envoy proxy [13:01:34] looking as well [13:01:36] seems envoy [13:01:39] yesah [13:01:58] <_joe_> yep [13:02:01] <_joe_> no idea why [13:02:03] permission denied on echorestore.log? [13:02:05] PROBLEM - Check systemd state on mw2152 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:07] <_joe_> it worked on the first two hosts [13:02:07] oh it is the envoy hippy [13:02:09] PROBLEM - Check systemd state on mw2182 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:10] lol [13:02:10] <_joe_> oh sigh yes [13:02:13] am I okay to revert my config change (seems unrelated to those problems) ? :) [13:02:17] _joe_: shall I stop puppet on appservers? [13:02:21] PROBLEM - Check systemd state on mw2288 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:27] <_joe_> cdanis: no, it's just spam [13:02:33] <_joe_> don't worry, it will be fixed now [13:02:38] ok [13:02:41] (03CR) 10Addshore: [C: 03+2] Read from the new term store, back to Q2 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575235 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [13:02:51] * addshore takes that as a yes [13:02:52] addshore: +1 [13:02:56] :) ty [13:03:01] PROBLEM - Check systemd state on mw2223 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:01] PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:09] PROBLEM - Check systemd state on mw2280 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:09] PROBLEM - Check systemd state on mw2202 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:16] <_joe_> addshore: yes, go on if you need to revert [13:03:25] PROBLEM - Check systemd state on mw2310 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:26] <_joe_> this is just noise [13:03:34] just noise, white noise [13:03:37] PROBLEM - Check systemd state on mw2137 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:45] PROBLEM - Check systemd state on mw2265 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:46] yeah, second puppet run fixes it [13:03:49] <_joe_> !log re-stopped puppet on codfw [13:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:56] <_joe_> moritzm: it shouldn't [13:03:57] PROBLEM - Check systemd state on mw2287 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:01] PROBLEM - Check systemd state on mw2255 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:07] PROBLEM - Check systemd state on mw2277 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:08] (03Merged) 10jenkins-bot: Read from the new term store, back to Q2 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575235 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [13:04:26] <_joe_> anyways, fixing it [13:04:27] yeah, you're right [13:04:29] PROBLEM - Check systemd state on mw2141 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:42] <_joe_> I have no idea how or why this is happening [13:04:43] PROBLEM - Check systemd state on mw2240 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:26] (03PS1) 10Jbond: ldap properties: add ldap config file to ease local testing [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575236 [13:05:29] PROBLEM - Check systemd state on mw2252 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:35] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q2M (was Q8M) again (T219123) (duration: 01m 03s) [13:05:39] PROBLEM - Check systemd state on mw2256 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:41] PROBLEM - Check systemd state on mw2274 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:47] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [13:06:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] templates: add templates base templates used for cas pages [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575214 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [13:06:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] themes: don't use externally hosted js/css files [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575215 (https://phabricator.wikimedia.org/T246010) (owner: 10Jbond) [13:06:13] PROBLEM - Check systemd state on mw2201 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:15] PROBLEM - Check systemd state on mw2225 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:21] (03CR) 10Jbond: [V: 03+2 C: 03+2] ldap properties: add ldap config file to ease local testing [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575236 (owner: 10Jbond) [13:06:31] PROBLEM - Check systemd state on mw2315 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:35] PROBLEM - Check systemd state on mw2253 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:35] PROBLEM - Check systemd state on mw2231 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:48] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q2M (was Q8M) again (T219123) ?cachebust (duration: 01m 03s) [13:06:49] RECOVERY - Check systemd state on mw2310 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:52] thats me done [13:07:01] RECOVERY - Check systemd state on mw2137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:04] <_joe_> !log restarting envoy, after chowning the log files, on all codfw mw servers where it was installed [13:07:07] RECOVERY - Check systemd state on mw2152 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:07] RECOVERY - Check systemd state on mw2147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:11] RECOVERY - Check systemd state on mw2265 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:11] RECOVERY - Check systemd state on mw2252 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:13] RECOVERY - Check systemd state on mw2182 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:21] RECOVERY - Check systemd state on mw2256 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:21] RECOVERY - Check systemd state on mw2258 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:23] RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:23] RECOVERY - Check systemd state on mw2274 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:27] RECOVERY - Check systemd state on mw2287 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:27] RECOVERY - Check systemd state on mw2288 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:29] RECOVERY - Check systemd state on mw2255 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:31] RECOVERY - Check systemd state on mw2319 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:31] RECOVERY - Check systemd state on mw2296 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:37] RECOVERY - Check systemd state on mw2277 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:01] RECOVERY - Check systemd state on mw2141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:01] RECOVERY - Check systemd state on mw2183 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:01] RECOVERY - Check systemd state on mw2201 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:03] RECOVERY - Check systemd state on mw2225 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:07] RECOVERY - Check systemd state on mw2195 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:11] RECOVERY - Check systemd state on mw2223 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:11] RECOVERY - Check systemd state on mw2157 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:13] RECOVERY - Check systemd state on mw2248 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:15] RECOVERY - Check systemd state on mw2240 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:17] RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:19] RECOVERY - Check systemd state on mw2315 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:19] RECOVERY - Check systemd state on mw2280 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:19] RECOVERY - Check systemd state on mw2202 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:23] RECOVERY - Check systemd state on mw2253 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:23] RECOVERY - Check systemd state on mw2231 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:39] (03PS1) 10Ayounsi: esams/knams: remove prepending and tcp-mss clamping [homer/public] - 10https://gerrit.wikimedia.org/r/575237 [13:09:34] (03CR) 10CDanis: [C: 03+1] esams/knams: remove prepending and tcp-mss clamping [homer/public] - 10https://gerrit.wikimedia.org/r/575237 (owner: 10Ayounsi) [13:09:46] (03CR) 10Ayounsi: [C: 03+2] esams/knams: remove prepending and tcp-mss clamping [homer/public] - 10https://gerrit.wikimedia.org/r/575237 (owner: 10Ayounsi) [13:10:03] (03Merged) 10jenkins-bot: esams/knams: remove prepending and tcp-mss clamping [homer/public] - 10https://gerrit.wikimedia.org/r/575237 (owner: 10Ayounsi) [13:11:55] !log esams/knams rollback tcp-mss camping and prepending [13:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:11] !log s/camping/clamping/ [13:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:49] RECOVERY - Check systemd state on mw2299 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:37] (03PS2) 10Filippo Giunchedi: swift: use fleetwide uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918) [13:24:44] (03CR) 10Filippo Giunchedi: swift: use fleetwide uid/gid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi) [13:28:37] <_joe_> !log installing envoy in eqiad too [13:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi) [13:34:50] (03CR) 10Muehlenhoff: "@Keith, Filippo: Yes, this isn't complete yet, there will a additional followup commits for Varnish and DNS as well." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff) [13:35:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:35:30] 10Operations, 10netops: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10ayounsi) p:05Triage→03Medium [13:36:31] <_joe_> uh [13:36:43] <_joe_> godog: ^^ can be related to my changes? [13:36:45] (03PS4) 10Muehlenhoff: Enable CAS endpoint for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/574499 [13:37:21] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:38:03] <_joe_> ahem [13:38:05] _joe_: mhh I'm not sure, that means icinga_exporter couldn't be queried in time [13:38:18] <_joe_> oh ok icinga_exporter [13:38:53] thinking out loud, icinga restarts shouldn't affect it either [13:48:46] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: use fleetwide uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi) [13:52:36] 10Operations, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10fgiunchedi) [13:58:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of typos, otherwise LGTM." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [13:59:38] (03PS1) 10Elukey: cdh::hive: improve jar file match regex to work with BigTop [puppet] - 10https://gerrit.wikimedia.org/r/575242 (https://phabricator.wikimedia.org/T244499) [14:02:21] (03CR) 10Elukey: [C: 03+2] cdh::hive: improve jar file match regex to work with BigTop [puppet] - 10https://gerrit.wikimedia.org/r/575242 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [14:03:26] jouncebot: next [14:03:26] In 2 hour(s) and 56 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1700) [14:03:30] jouncebot: now [14:03:30] No deployments scheduled for the next 2 hour(s) and 56 minute(s) [14:04:56] (03CR) 10Gilles: "The Thumbor configuration for tests is different than the configuration of Thumbor as installed by the Debian packages." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER) [14:05:24] (03PS1) 10Urbanecm: Increase arwiki's WikiGap throttle lift to 400 accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575243 (https://phabricator.wikimedia.org/T246092) [14:05:34] (03CR) 10Gilles: "*if you made a mistake" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER) [14:05:36] (03CR) 10Urbanecm: [C: 03+2] Increase arwiki's WikiGap throttle lift to 400 accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575243 (https://phabricator.wikimedia.org/T246092) (owner: 10Urbanecm) [14:06:34] (03Merged) 10jenkins-bot: Increase arwiki's WikiGap throttle lift to 400 accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575243 (https://phabricator.wikimedia.org/T246092) (owner: 10Urbanecm) [14:07:37] (03PS1) 10Giuseppe Lavagetto: role::parsoid: base it on role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/575244 [14:08:25] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: 7e3a57a: Increase arwiki WikiGap throttle lift to 400 accounts (T246092) (duration: 01m 05s) [14:09:31] where are you, stashbot ? [14:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:18] T246092: Temporary lift IP cap for WikiGap edit-a-thon at Khawarizmi College in 5 March 2020 - https://phabricator.wikimedia.org/T246092 [14:11:12] (03PS2) 10Giuseppe Lavagetto: role::parsoid: base it on role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/575244 [14:17:14] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Gehel) [14:20:23] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10bd808) >>! In T244986#5922092, @wiki_willy wrote: > @aborrero (and @Jclark-ctr for visibility) - it looks this was purchased back in 2014, and past the 5yr ser... [14:20:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/21120/" [puppet] - 10https://gerrit.wikimedia.org/r/575244 (owner: 10Giuseppe Lavagetto) [14:25:35] (03PS1) 10Vgutierrez: install_server: Reimage lvs4007 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575246 (https://phabricator.wikimedia.org/T245984) [14:26:17] (03CR) 10Muehlenhoff: [C: 03+2] Enable CAS endpoint for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff) [14:26:37] (03CR) 10BryanDavis: [C: 03+2] d/changelog: prepare 0.64 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575111 (owner: 10BryanDavis) [14:27:06] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage lvs4007 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575246 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [14:29:46] (03Merged) 10jenkins-bot: d/changelog: prepare 0.64 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575111 (owner: 10BryanDavis) [14:33:30] (03CR) 10Ottomata: "I (obviously) think my approach using main_app.name as the 'main app' identifier is a good one. I think Alex does too. There might be so" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [14:33:52] (03PS7) 10Hnowlan: Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) [14:35:20] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs4007.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [14:35:50] !log reimage lvs4007 with buster - T245984 [14:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:56] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [14:37:37] (03CR) 10Ottomata: "BTW, the stuff I did for eventgate chart does change some of the conventions we've been using already. I think the phab ticket Petr linke" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [14:38:22] (03CR) 10Hnowlan: Admin: Add changeprop namespace (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [14:49:57] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:50:57] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:53:29] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:52] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:03] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1084 at 50% T245621', diff saved to https://phabricator.wikimedia.org/P10542 and previous config saved to /var/cache/conftool/dbconfig/20200227-150302-jynus.json [15:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:10] T245621: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 [15:03:37] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs4007.ulsfo.wmnet'] ` and were **ALL** successful. [15:06:39] (03CR) 10Alexandros Kosiaris: "I tend to agree with Petr, this approach feels wrong. Having a flag to switch from cpjobqueue to changeprop essentially says that we can't" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [15:07:28] (03CR) 10Jhedden: [C: 03+2] toolforge: upgrade elasticsearch and add debian buster support [puppet] - 10https://gerrit.wikimedia.org/r/574527 (https://phabricator.wikimedia.org/T236606) (owner: 10Jhedden) [15:10:33] (03PS1) 10Vgutierrez: lvs: Reimage lvs4006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575256 (https://phabricator.wikimedia.org/T245984) [15:11:10] (03PS2) 10Vgutierrez: install_server,lvs: Reimage lvs4006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575256 (https://phabricator.wikimedia.org/T245984) [15:12:48] (03CR) 10Vgutierrez: [C: 03+2] install_server,lvs: Reimage lvs4006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575256 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [15:13:00] (03PS3) 10Vgutierrez: install_server,lvs: Reimage lvs4006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575256 (https://phabricator.wikimedia.org/T245984) [15:15:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add discovery for eventgate-analytics-external [puppet] - 10https://gerrit.wikimedia.org/r/573366 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [15:15:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] Route intake-analytics.wm.org to eventgate-analytics-external [puppet] - 10https://gerrit.wikimedia.org/r/573369 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [15:16:33] (03PS1) 10Elukey: role::search::airflow: allow analytics-admins to ssh to hosts [puppet] - 10https://gerrit.wikimedia.org/r/575260 [15:16:37] 10Operations, 10DBA: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10jcrespo) I will let @Marostegui put it back to 100% and do the full revert and finishing touches + resolv. [15:17:49] !log reimage lvs4006 with buster - T245984 [15:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:56] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs4006.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [15:17:56] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [15:18:38] (03CR) 10Elukey: [C: 03+2] role::search::airflow: allow analytics-admins to ssh to hosts [puppet] - 10https://gerrit.wikimedia.org/r/575260 (owner: 10Elukey) [15:19:46] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm -- I'd like Krenair to confirm that there aren't any <4 puppetmasters still living in cloud-vps" [puppet] - 10https://gerrit.wikimedia.org/r/575220 (https://phabricator.wikimedia.org/T246327) (owner: 10Jbond) [15:22:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Question: nothing really contacts change-prop via HTTP, except maybe service-checker that does a simple health check. Do we even want to" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [15:23:52] !log installing curl security updates on stretch/buster [15:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:20] (03CR) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [15:24:22] (03PS3) 10Hnowlan: changeprop: add hierdata k8s entries [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) [15:24:43] (03CR) 10Hnowlan: changeprop: add hierdata k8s entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [15:26:09] (03CR) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [15:28:54] (03PS7) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) [15:29:10] !log restarting mw canaries to pick up curl update [15:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:08] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:31:18] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.21/extensions/ConfirmEdit/includes/auth/CaptchaPreAuthenticationProvider.php: T245280 (duration: 01m 05s) [15:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:24] T245280: logstash_formatter_key_conflict in mediawiki logs - https://phabricator.wikimedia.org/T245280 [15:32:40] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:32:49] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/ConfirmEdit/includes/auth/CaptchaPreAuthenticationProvider.php: T245280 (duration: 01m 04s) [15:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A few comments, but overall this looks pretty close to being ready. Nice!" (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [15:33:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:35:02] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:24] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [15:37:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:37:22] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:37:41] (03PS1) 10Addshore: Read from the new term store up to Q4 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575264 (https://phabricator.wikimedia.org/T219123) [15:37:46] jouncebot: now [15:37:47] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [15:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:12] (03PS1) 10Elukey: profile::analytics::search::airflow: fix group require [puppet] - 10https://gerrit.wikimedia.org/r/575265 [15:38:22] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:39:56] (03CR) 10Elukey: "Puppet has been broken for a long time due to this bug, let's check it when doing changes :)" [puppet] - 10https://gerrit.wikimedia.org/r/575265 (owner: 10Elukey) [15:40:43] (03CR) 10Elukey: [C: 03+2] profile::analytics::search::airflow: fix group require [puppet] - 10https://gerrit.wikimedia.org/r/575265 (owner: 10Elukey) [15:41:03] (03PS1) 10Jhedden: toolforge: add prometheus exporter for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/575266 [15:44:12] (03CR) 10Jhedden: [C: 03+2] toolforge: add prometheus exporter for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/575266 (owner: 10Jhedden) [15:44:42] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs4006.ulsfo.wmnet'] ` and were **ALL** successful. [15:45:27] 10Operations, 10ops-eqiad, 10DC-Ops: (Due by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10RobH) [15:45:48] 10Operations, 10ops-eqiad, 10DC-Ops: (Due by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10RobH) [15:46:02] 10Operations, 10ops-eqiad, 10DC-Ops: (Due by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10RobH) [15:46:56] (03PS2) 10Effie Mouzeli: thumbor: remove nginx code leftovers [puppet] - 10https://gerrit.wikimedia.org/r/572033 [15:50:24] (03PS1) 10Giuseppe Lavagetto: ProductionServices: switch search to use envoy instead of nginx [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575268 (https://phabricator.wikimedia.org/T244843) [15:50:28] (03PS1) 10Giuseppe Lavagetto: ProductionServices: use local http proxy for parsoid, parsoidphp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575269 (https://phabricator.wikimedia.org/T244843) [15:50:30] (03PS1) 10Giuseppe Lavagetto: ProductionServices: use the local proxy for sessionstore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575270 (https://phabricator.wikimedia.org/T244843) [15:50:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Looks ok, except the fact we're still not specifying where to connect to Redis. the secrets will get us the redis path, but the redis ur" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [15:52:35] !log installing python-pysaml security updates [15:52:36] (03PS8) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) [15:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:10] (03PS1) 10Andrew Bogott: keystone hooks: create .wmcloud.org project domain during project creation [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174) [15:54:12] (03CR) 10Effie Mouzeli: "PCC for mwdebug https://puppet-compiler.wmflabs.org/compiler1001/21122/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [15:54:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] ProductionServices: switch search to use envoy instead of nginx [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575268 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:55:57] (03CR) 10jerkins-bot: [V: 04-1] keystone hooks: create .wmcloud.org project domain during project creation [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174) (owner: 10Andrew Bogott) [15:56:02] !log installing python-django updates (packaged Debian version) [15:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:54] it seems a calmer moment, I'll merge the icinga patch that should be noop [15:57:02] (03CR) 10Volans: [C: 03+2] icinga: fix use of stale unpuppetized check files [puppet] - 10https://gerrit.wikimedia.org/r/574993 (owner: 10Volans) [15:58:39] (03PS2) 10Andrew Bogott: keystone hooks: create .wmcloud.org project domain during project creation [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174) [15:59:00] (03CR) 10Addshore: [C: 03+2] Read from the new term store up to Q4 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575264 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [15:59:12] take 50 [15:59:16] * addshore lost count [15:59:45] !log installing e2fsck security updates on buster [15:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:07] (03Merged) 10jenkins-bot: Read from the new term store up to Q4 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575264 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [16:00:10] (03PS3) 10Andrew Bogott: keystone hooks: create .wmcloud.org project domain during project creation [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174) [16:02:30] !log disable puppet on thumbor* [16:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:08] 10Operations: Integrate Buster 10.3 point update - https://phabricator.wikimedia.org/T244693 (10MoritzMuehlenhoff) [16:03:46] (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs4006 [puppet] - 10https://gerrit.wikimedia.org/r/575274 (https://phabricator.wikimedia.org/T245984) [16:05:24] !log installing python3.7 security updates on Buster [16:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:50] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q4M for the new term store for clients (was Q2M) + warm db1126 caches (T219123) (duration: 01m 04s) [16:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:57] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [16:07:26] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q4M for the new term store for clients (was Q2M) + warm db1126 caches (T219123) cache bust (duration: 01m 04s) [16:07:27] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2007.codfw.wmnet ` The log can be fou... [16:07:33] (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs4006 [puppet] - 10https://gerrit.wikimedia.org/r/575274 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [16:08:00] !log begin warming wikidata term cache on db1126 for Q4-6 million T219123 [16:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:57] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2008.codfw.wmnet ` The log can be fou... [16:09:10] !log re-enable BGP in lvs4006 - T245984 [16:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:35] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [16:10:45] !log mwscript extensions/AbuseFilter/maintenance/fixOldLogEntries.php --wiki=mediawikiwiki --verbose (T228655) [16:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:51] T228655: Dry-run fixOldLogEntries for AbuseFilter - https://phabricator.wikimedia.org/T228655 [16:11:47] (03PS4) 10Andrew Bogott: keystone hooks: create .wmcloud.org project domain during project creation [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174) [16:11:53] !log foreachwiki extensions/AbuseFilter/maintenance/fixOldLogEntries.php --verbose started (T228655) [16:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:03] (03PS1) 10Vgutierrez: install_server,lvs: Reimage lvs4005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575277 (https://phabricator.wikimedia.org/T245984) [16:12:52] !log rebooting parse2009 to clear memory error [16:12:53] (03CR) 10Holger Knust: "Redis is defined in the defaults and to keep it consistent with the other KVs, I overrode only the non-default items. Still add to the ind" [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [16:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:30] (03CR) 10Vgutierrez: [C: 03+2] install_server,lvs: Reimage lvs4005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575277 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [16:14:49] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: remove nginx code leftovers [puppet] - 10https://gerrit.wikimedia.org/r/572033 (owner: 10Effie Mouzeli) [16:15:19] effie: may I merge that? [16:15:57] PROBLEM - Host parse2009 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:17] effie: :? :) [16:16:45] RECOVERY - Host parse2009 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [16:18:45] (03PS5) 10Andrew Bogott: keystone hooks: create new default domains for new projects [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174) [16:20:36] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs4005.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [16:20:38] !log reimage lvs4005 with buster - T245984 [16:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:44] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [16:21:45] !log installing wget security updates on jessie [16:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:03] (03CR) 10Dzahn: [C: 03+2] site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112) (owner: 10Dzahn) [16:22:18] (03PS4) 10Dzahn: site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112) [16:22:24] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:08] 10Operations: Integrate Buster 10.3 point update - https://phabricator.wikimedia.org/T244693 (10MoritzMuehlenhoff) [16:23:53] no backups on apt1001 [16:23:55] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:05] moritzm: ^yours, new host? [16:24:10] jynus: new host [16:24:12] (03PS6) 10Andrew Bogott: keystone hooks: create new default domains for new projects [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174) [16:24:18] just got the role the other day [16:24:21] cool, then no issue- only a warning [16:24:25] jynus: yeah, those are replacing install* [16:24:30] cool, it does have backup::host class [16:24:40] let me know if interested to do a manual run [16:24:46] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:24:48] when it has something meaning full to test it [16:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:59] the data was rsynced from install1002 .. so we already have a backup of that [16:24:59] otherwise it will be done automatically at the begining of the month [16:25:06] i think March 1st is enbough [16:25:41] cool, just announcing to feel free to ask me any operations in the future [16:26:10] specially when moving hosts, it is super-easy to do a custom run [16:26:24] it is literally just executing "run" :-D [16:26:26] 10Operations: Log the real X-Client-IP - https://phabricator.wikimedia.org/T246348 (10Reedy) [16:27:10] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:32] jynus: thank you, sounds good [16:29:31] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2007.codfw.wmnet'] ` and were **ALL** successful. [16:31:55] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2008.codfw.wmnet'] ` and were **ALL** successful. [16:34:47] (03PS10) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 [16:36:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [16:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:09] (03CR) 10Bstorm: [C: 03+2] labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 (owner: 10Bstorm) [16:39:02] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:40:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:43:24] PROBLEM - Host lvs4005 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:22] RECOVERY - Host lvs4005 is UP: PING OK - Packet loss = 0%, RTA = 74.65 ms [16:45:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "Good enough is good enough™" [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [16:46:45] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs4005.ulsfo.wmnet'] ` and were **ALL** successful. [16:48:04] (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/575295 (https://phabricator.wikimedia.org/T245984) [16:49:10] !log END warming wikidata term cache on db1126 for Q4-6 million T219123 (pass1) [16:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:15] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [16:49:22] !log START warming wikidata term cache on db1126 for Q4-6 million T219123 (pass2) [16:49:24] (03PS2) 10SBassett: Deployment group audit [puppet] - 10https://gerrit.wikimedia.org/r/574869 (https://phabricator.wikimedia.org/T237696) [16:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:42] !log temporarily decommented external check for icinga2001. Restarting Icinga on icinga2001 [16:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:02] (03PS1) 10Bstorm: labstore: finish setting up the firewall on the old primary cluster [puppet] - 10https://gerrit.wikimedia.org/r/575296 (https://phabricator.wikimedia.org/T165136) [16:52:28] (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/575295 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [16:53:25] (03PS6) 10Krinkle: Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [16:55:02] !log re-enable BGP in lvs4005 - T245984 [16:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:08] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [16:55:10] (03CR) 10Krinkle: [C: 03+2] Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [16:56:02] (03CR) 10Bstorm: [C: 03+2] labstore: finish setting up the firewall on the old primary cluster [puppet] - 10https://gerrit.wikimedia.org/r/575296 (https://phabricator.wikimedia.org/T165136) (owner: 10Bstorm) [16:56:14] (03Merged) 10jenkins-bot: Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [16:56:47] (03PS3) 10SBassett: Deployment group audit [puppet] - 10https://gerrit.wikimedia.org/r/574869 (https://phabricator.wikimedia.org/T237696) [16:57:44] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: First Paint desktop, First Paint mobile, INM Satisfaction Ratio, Load Event End overall, Response Start desktop, Response Start mobile, Varnish frontend hit rate. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:58:38] the failing grafana checks are known, patch incoming [16:59:48] !log Disabled new account creation on wikitech via horrible TitleBlacklist hack. [17:00:04] godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:03:31] !log reimage lvs5003 with buster - T245984 [17:03:42] oh no wikibugs is gone :( [17:04:08] hmm stashbot as well? [17:04:21] yup :_( [17:04:38] true :( [17:04:56] now nobody cares for that I write here [17:04:58] :_( [17:05:26] !log krinkle@deploy1001 Synchronized wmf-config/mc.php: I119aff6312463 - allow_tcp_nagle_delay:off (duration: 01m 05s) [17:05:34] PROBLEM - Host lvs5003 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:45] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1087 at 10% T232446', diff saved to https://phabricator.wikimedia.org/P10546 and previous config saved to /var/cache/conftool/dbconfig/20200227-170543-jynus.json [17:06:36] RECOVERY - Host lvs5003 is UP: PING OK - Packet loss = 0%, RTA = 231.33 ms [17:06:41] uh... [17:06:48] lvs5003 should be downtimed by the reimage script [17:07:48] doing it manually... [17:08:47] someone should probably re!log those logmsgbot things once stashbot is back? [17:09:24] (I’m leaving soon so I probably can’t do it myself) [17:11:01] grafana alerts should be recovering soon [17:11:46] !log END warming wikidata term cache on db1126 for Q4-6 million T219123 (pass2) [17:11:54] jouncebot: now [17:11:54] For the next 0 hour(s) and 48 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1700) [17:14:26] are we having a netsplit or something that I missed? or are all the bots just dead? [17:14:42] wikitech technical issues [17:15:19] hmm wikibugs also gone, im guessing that is related? [17:15:28] it is yeah [17:15:42] * addshore was just about to move from Q4 million to Q6 million for wikidata item term reads on clients [17:16:03] I think we can go on [17:16:05] :) [17:16:19] * addshore announces doing a thing in mediawiki-config [17:16:20] we havent lost observability [17:16:39] for reference my thingy is https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/575301/ [17:17:38] (03Merged) 10jenkins-bot: Read from the new term store up to Q6 million for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575301 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [17:18:07] :) [17:18:27] !log (relog FROM 5:11) END warming wikidata term cache on db1126 for Q4-6 million T219123 (pass2) [17:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:33] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [17:18:43] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10Papaul) |Servers|NIC1|NIC2|NIC3|NIC4|Note| |lvs2007| |lvs2008|asw-b2 xe-2/0/45|'A7': xe-7/0/45|C2': xe-2/0/45|D2': xe-2/0/46| using the same cables lvs2006 was using... [17:19:04] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q6M for the new term store for clients (was Q4M) + warm db1126 caches (T219123) (duration: 01m 04s) [17:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:10] 10Operations, 10Traffic, 10Patch-For-Review: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10Vgutierrez) 05Stalled→03Resolved a:03Vgutierrez [17:19:12] 10Operations, 10Acme-chief, 10Traffic: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10Vgutierrez) [17:19:43] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10wiki_willy) @bd808 - thanks for providing the background context around these. I hit up Rob to prioritize T243471 more. (quotes being submitted soon) Also, w... [17:20:20] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q6M for the new term store for clients (was Q4M) + warm db1126 caches (T219123) cache bust (duration: 01m 04s) [17:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:51] (03CR) 10BBlack: [C: 03+2] wikiworkshop.org: switch DNS to our text endpoint [dns] - 10https://gerrit.wikimedia.org/r/575304 (https://phabricator.wikimedia.org/T242374) (owner: 10BBlack) [17:21:00] (03PS1) 10Vgutierrez: ATS: Switch unified cert vendor to Let's Encrypt on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/575305 (https://phabricator.wikimedia.org/T230687) [17:24:13] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:24:22] (03CR) 10Herron: [C: 03+2] "LGTM -- aiui the ruby clientip shuffling is expected to be temporary until the apache logs are more consistently formatted" [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [17:24:27] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10wiki_willy) [17:24:58] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10wiki_willy) T246365 created for ordering the replacement drive. Thanks, Willy [17:25:45] RECOVERY - Check whether ferm is active by checking the default input chain on labstore1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:26:30] (03CR) 10Vgutierrez: "pcc looks sane: https://puppet-compiler.wmflabs.org/compiler1002/21130/" [puppet] - 10https://gerrit.wikimedia.org/r/575305 (https://phabricator.wikimedia.org/T230687) (owner: 10Vgutierrez) [17:27:19] (03CR) 10Vgutierrez: [C: 04-2] "merge on Monday :)" [puppet] - 10https://gerrit.wikimedia.org/r/575305 (https://phabricator.wikimedia.org/T230687) (owner: 10Vgutierrez) [17:29:37] (03PS1) 10Bstorm: labstore: one more nfs ferm fix for the primary cluster [puppet] - 10https://gerrit.wikimedia.org/r/575307 [17:30:17] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1087 at 20% T232446', diff saved to https://phabricator.wikimedia.org/P10547 and previous config saved to /var/cache/conftool/dbconfig/20200227-173017-jynus.json [17:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:23] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [17:30:43] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [17:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:25] !log (from 17:03) reimage lvs5003 with buster - T245984 [17:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:30] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [17:31:34] !log START warming wikidata term cache on db1126 for Q6-8 million T219123 (pass1) [17:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:39] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [17:31:42] addshore: 05:11 != 17:11 ;P [17:31:53] vgutierrez: i realized that once I sent ti >.> [17:32:14] addshore: so happy so far? [17:32:25] jynus: 6 million seems to be behaving [17:33:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:18] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) [17:33:27] jynus: the one thing I still see that makes me think it might be a contributing factor is the sending data state of some processes https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&refresh=30s&fullscreen&panelId=37 [17:33:52] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) 05Open→03Resolved @Dzahn @joe all 20 servers ready for service [17:34:02] sending that is a bit meaningless, like idle [17:34:08] it means "it is doing something" [17:34:10] (03CR) 10Bstorm: [C: 03+2] labstore: one more nfs ferm fix for the primary cluster [puppet] - 10https://gerrit.wikimedia.org/r/575307 (owner: 10Bstorm) [17:34:13] jynus: okay :P [17:34:28] will correlate with spikes [17:34:45] !log volans@cumin1001 START - Cookbook sre.hosts.downtime [17:34:46] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:52] that graph is only useful for the total [17:35:02] and for waiting/altering/updating [17:37:11] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw: rack/setup/install 3 new payments server for frack - https://phabricator.wikimedia.org/T244169 (10Papaul) a:05Papaul→03Jgreen @Jgreen All yours. [17:37:56] PROBLEM - PyBal connections to etcd on lvs5003 is CRITICAL: CRITICAL: 0 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [17:38:10] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [17:39:50] sigh... damn icinga [17:40:15] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Ferm rules for labstore1004/1005 NFS hosts - https://phabricator.wikimedia.org/T165136 (10Bstorm) 05Open→03Resolved a:03Bstorm The cluster runs ferm rules now. [17:40:23] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:40:25] (03PS1) 10Andrew Bogott: puppetmasters: remove the install-console script [puppet] - 10https://gerrit.wikimedia.org/r/575309 [17:40:28] PROBLEM - Host lvs5003 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:48] ^^ that host is being reimaged and theoretically is downtimed :/ [17:40:54] !log stop and mask all nginx on thumbor* [17:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:00] !log enable puppet on thumbor* [17:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:19] 10Operations, 10Icinga, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Volans) 05Open→03Resolved a:03Volans Resolving as this is an old task and that issue has been fixed, despite we've a similar one right now. [17:41:46] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:43:22] RECOVERY - Host lvs5003 is UP: PING OK - Packet loss = 0%, RTA = 231.37 ms [17:44:08] (03PS1) 10Andrew Bogott: Add cloudvirt-wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/575312 (https://phabricator.wikimedia.org/T221631) [17:45:43] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs5003.eqsin.wmnet'] ` and were **ALL** successful. [17:46:10] (03CR) 10Andrew Bogott: [C: 03+2] Add cloudvirt-wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/575312 (https://phabricator.wikimedia.org/T221631) (owner: 10Andrew Bogott) [17:47:19] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [17:48:45] RECOVERY - PyBal connections to etcd on lvs5003 is OK: OK: 16 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [17:49:31] !log delete commonswiki_file_1582685980 from cloudelastic-chi, reindex failed and commonswiki_file_first is still primary [17:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:03] !log resume item migration script at Q50 million T219123 (batch size of 100, 1s sleep) [17:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:09] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [17:52:14] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [17:54:03] (03PS1) 10Andrew Bogott: add host hiera info for cloudvirt-wdqs100[123] [puppet] - 10https://gerrit.wikimedia.org/r/575315 [17:56:27] (03CR) 10Andrew Bogott: [C: 03+2] add host hiera info for cloudvirt-wdqs100[123] [puppet] - 10https://gerrit.wikimedia.org/r/575315 (owner: 10Andrew Bogott) [17:57:07] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Redis is defined in the defaults and to keep it consistent with the other KVs, I overrode only the non-default items. Still add to the i" [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [17:59:34] (03PS8) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [18:00:04] cscott, arlolra, subbu, halfak, and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1800). [18:00:16] 10Operations, 10ops-eqiad: db1098 power redundancy lost - https://phabricator.wikimedia.org/T246323 (10jcrespo) Please ping me if it is not something as obvious as a cable and need it down to prepare the host. [18:03:19] (03PS1) 10Dzahn: fix IP address for apt2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/575318 (https://phabricator.wikimedia.org/T224576) [18:06:25] (03PS2) 10Dzahn: fix IP address for apt2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/575318 (https://phabricator.wikimedia.org/T224576) [18:07:43] PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:08:55] PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:42] RECOVERY - Host cloudvirt-wdqs1003 is UP: PING WARNING - Packet loss = 37%, RTA = 0.26 ms [18:09:52] PROBLEM - configured eth on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:12:59] (03PS1) 10Herron: add profile::idp::client::httpd hiera for elk7 env [puppet] - 10https://gerrit.wikimedia.org/r/575320 (https://phabricator.wikimedia.org/T234854) [18:13:02] RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [18:13:30] PROBLEM - puppet last run on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:13:34] PROBLEM - Check systemd state on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:07] (03PS9) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [18:14:38] PROBLEM - Host cloudvirt-wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:12] PROBLEM - DPKG on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:15:44] PROBLEM - DPKG on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:15:46] PROBLEM - dhclient process on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:15:57] (03CR) 10Dzahn: [C: 03+2] fix IP address for apt2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/575318 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [18:18:45] 10Operations, 10ops-eqiad: db1098 power redundancy lost - https://phabricator.wikimedia.org/T246323 (10Jclark-ctr) 05Open→03Resolved @jcrespo Reseated power cable Psu powered on closing ticket [18:19:46] (03PS5) 10Dzahn: site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112) [18:20:02] PROBLEM - configured eth on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:20:13] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10RobH) [18:20:16] !log upload prometheus-mcrouter-exporter 0.1.0+git20200227-1 to stretch-wikimedia [18:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:21] correcting a bug --^ [18:20:42] PROBLEM - Disk space on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1003&var-datasource=eqiad+prometheus/ops [18:21:18] (03PS1) 10Bstorm: toolforge-kubernetes: shut down the old maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/575322 (https://phabricator.wikimedia.org/T214513) [18:21:29] !log END warming wikidata term cache on db1126 for Q6-8 million T219123 (pass1) (will do 2 more passes tomorrow) [18:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:34] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [18:22:30] PROBLEM - dhclient process on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:22:48] RECOVERY - IPMI Sensor Status on db1098 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:25:12] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [18:25:37] could we have some downtimes for these? [18:26:19] mutante: icinga is also having problems with the external command, FYI, I'm troubleshooting [18:27:20] volans: oh..the passive checks from FR? thank you! *nod* [18:27:46] that almost sounded like firewall change [18:27:59] if restarting nsca did not fix it ..yet they are sending packets as normal [18:28:38] yep [18:28:42] it's the command file [18:28:48] some pass some not [18:28:52] same for downtimes [18:28:52] oh [18:29:18] (03PS1) 10Bstorm: toolforge: remove the ancient version of kubectl [puppet] - 10https://gerrit.wikimedia.org/r/575325 (https://phabricator.wikimedia.org/T214513) [18:29:27] both 1001 and 2001, but 2001 stopped 25m ago [18:29:45] both.. that's weird [18:30:30] I'll go with a full rstart, didn't solve the issue before on 2001 but the last restart did [18:31:03] i was about to suggest that.. i had vague memories of a similar thing and that fixed it ..yea [18:31:15] !log restarting icinga on icinga1001, command file randomly discarding commands [18:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:32] RECOVERY - Host cloudvirt-wdqs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [18:33:44] PROBLEM - MegaRAID on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:33:51] have to waid now [18:33:54] *wait [18:34:03] ok [18:34:05] (03CR) 10Herron: "puppet is currently broken on the elk7 collectors because this hiera is missing, so no diff is displayed, but it LGTM https://puppet-compi" [puppet] - 10https://gerrit.wikimedia.org/r/575320 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:34:58] PROBLEM - Check systemd state on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:19] mutante: the downtime for now worked [18:35:23] so promising [18:35:42] but need to check logs for awol [18:35:44] PROBLEM - puppet last run on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:35:44] PROBLEM - IPMI Sensor Status on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:37:03] so far recovering, but I'm not happy as we don't have a real root cause, I had tried also the debug log a bit [18:37:06] PROBLEM - Disk space on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1002&var-datasource=eqiad+prometheus/ops [18:37:31] (03PS2) 10Bstorm: toolforge-kubernetes: shut down the old maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/575322 (https://phabricator.wikimedia.org/T214513) [18:37:32] PROBLEM - IPMI Sensor Status on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:37:54] !log milimetric@deploy1001 Started deploy [analytics/refinery@357ff5c]: Refinery using 0.0.115 [18:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:57] gehel: what's up with all those cloudvirt-wdqs? [18:38:04] PROBLEM - Long running screen/tmux on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [18:38:40] PROBLEM - puppet last run on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:39:17] volans: cool, thanks. yea, sucks to not have a root cause but as long as it happens just every few months i guess we can deal with it [18:39:18] PROBLEM - configured eth on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:39:38] sounds like last time indeed [18:39:57] depends which last time [18:40:05] because one of the last times was nsca the issue, and we fixed that [18:40:25] yea.. that was different. that's when we had to kill all the nsca processes afair [18:40:57] the one where icinga dropped some commands from the cmdfile ..like now [18:41:03] or did not notice them [18:41:14] and restarting icinga itself fixed it [18:41:42] PROBLEM - dhclient process on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:42:17] (03PS4) 10Matěj Suchánek: Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 [18:43:33] !log adding parse2* machines to puppet [18:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:20] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [18:48:05] !log milimetric@deploy1001 Finished deploy [analytics/refinery@357ff5c]: Refinery using 0.0.115 (duration: 10m 11s) [18:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:14] (03CR) 10Herron: [C: 03+2] add profile::idp::client::httpd hiera for elk7 env [puppet] - 10https://gerrit.wikimedia.org/r/575320 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:49:22] (03PS10) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [18:49:40] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10greg) Approved for all 3 from my end. [18:49:41] (03CR) 10Greg Grossmeier: [C: 03+1] "Approved." [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053) (owner: 10Dzahn) [18:49:52] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [18:51:00] !log upgrade prometheus-mcrouter-exporter to 0.1.0+git20200227-1 on hosts [18:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:30] PROBLEM - Host cloudvirt-wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:54:12] (03CR) 10Herron: "some comments inline and updated pcc https://puppet-compiler.wmflabs.org/compiler1003/21137/" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:55:12] RECOVERY - Host cloudvirt-wdqs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:55:34] PROBLEM - MegaRAID on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:58:25] (03PS1) 10Dzahn: installserver: add apt2001 to fail over servers for APT repo sync [puppet] - 10https://gerrit.wikimedia.org/r/575327 [18:59:59] (03CR) 10Bstorm: "The nature of the timer::job type requires all that mess to be in there even though this is just an ensure => absent" [puppet] - 10https://gerrit.wikimedia.org/r/575322 (https://phabricator.wikimedia.org/T214513) (owner: 10Bstorm) [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1900). [19:00:04] tgr: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:25] o/ [19:00:46] PROBLEM - Long running screen/tmux on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [19:01:53] PROBLEM - Host cloudvirt-wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:02:46] PROBLEM - MegaRAID on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:03:44] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10chasemp) @HMarcus @MoritzMuehlenhoff Can we all agree on 6 weeks notice to SRE before going live as a control here? If so I think that closes... [19:04:58] RECOVERY - Host cloudvirt-wdqs1001 is UP: PING WARNING - Packet loss = 93%, RTA = 0.28 ms [19:05:04] * Krinkle takes mwdebug1001 for performance testing [19:05:06] I can self-SWAT [19:05:26] * Krinkle waits for tgr [19:05:28] ok :) [19:05:47] Krinkle: will it interfere? I can use 1002 [19:06:09] tgr: a scap sync will override my local changes so yeah I'll wait [19:06:34] PROBLEM - Host cloudvirt-wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:06:36] (03PS2) 10Gergő Tisza: Enable articletopic: search keyword in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) [19:06:41] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [19:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:11] (03CR) 10Gergő Tisza: [C: 03+2] Enable articletopic: search keyword in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) (owner: 10Gergő Tisza) [19:07:13] (03PS1) 10Effie Mouzeli: hieradata: send mw1262's apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/575329 (https://phabricator.wikimedia.org/T244472) [19:07:23] volans: I'm late, but those should be new servers in WMCS that will be dedicated to wdqs testing. [19:08:02] Atm they are just the virtualization hosts, nothing wdqs specific there yet [19:08:42] (03Merged) 10jenkins-bot: Enable articletopic: search keyword in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) (owner: 10Gergő Tisza) [19:10:41] (03CR) 10Herron: [C: 03+1] hieradata: send mw1262's apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/575329 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [19:12:29] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1002/21138/mw1262.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/575329 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [19:12:33] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: send mw1262's apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/575329 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [19:12:57] !log milimetric@deploy1001 Started deploy [analytics/refinery@357ff5c] (thin): Refinery using 0.0.115 [19:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:04] !log milimetric@deploy1001 Finished deploy [analytics/refinery@357ff5c] (thin): Refinery using 0.0.115 (duration: 00m 07s) [19:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:54] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Degraded RAID on analytics1044 - https://phabricator.wikimedia.org/T245910 (10Nuria) 05Open→03Resolved [19:14:20] PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:22] PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:24] (03PS1) 10Krinkle: [DNM] Test LCStoreArray on mwdebug1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575331 [19:15:28] RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [19:16:46] (03CR) 10Nuria: Make normalized request count available in Turnilo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575035 (https://phabricator.wikimedia.org/T241162) (owner: 10Milimetric) [19:16:48] RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:16:49] (03CR) 10Elukey: [C: 03+2] Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [19:17:30] !log ganeti2001 - removing VM apt2001 to re-create it after IP change [19:17:31] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [19:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:42] !log depool mw1262 [19:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:52] RECOVERY - Host cloudvirt-wdqs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:17:57] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [19:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:18] (03PS2) 10Dzahn: admins: add tchanders, dmaza and wikigit to deployers [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053) [19:20:32] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574634|Enable articletopic: search keyword in CirrusSearch (T240559)]] (duration: 01m 05s) [19:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:37] T240559: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword - https://phabricator.wikimedia.org/T240559 [19:20:44] PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:21:40] PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:21:45] (03CR) 10Dzahn: [C: 03+2] admins: add tchanders, dmaza and wikigit to deployers [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053) (owner: 10Dzahn) [19:21:57] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: once more for good measure (duration: 01m 03s) [19:22:00] RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [19:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:11] (03CR) 10Volans: [C: 03+1] "LGTM, the existing file will need a manual cleanup on those 7 hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/575309 (owner: 10Andrew Bogott) [19:22:31] Krinkle: all yours [19:22:42] tgr: thanks [19:23:23] (03CR) 10Andrew Bogott: [C: 03+2] puppetmasters: remove the install-console script [puppet] - 10https://gerrit.wikimedia.org/r/575309 (owner: 10Andrew Bogott) [19:24:26] PROBLEM - configured eth on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:25:29] (03CR) 10Andrew Bogott: [C: 03+2] "cleanup done" [puppet] - 10https://gerrit.wikimedia.org/r/575309 (owner: 10Andrew Bogott) [19:26:06] PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:26:21] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [19:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:50] PROBLEM - dhclient process on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:27:40] RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:28:05] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 5 others: Public schema.wikimedia.org endpoint for schema.svc - https://phabricator.wikimedia.org/T233630 (10Nuria) 05Open→03Resolved [19:28:32] RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:29:42] PROBLEM - Check systemd state on cloudvirt-wdqs1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.44: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:56] PROBLEM - DPKG on cloudvirt-wdqs1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.44: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:30:07] (03PS11) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [19:30:08] PROBLEM - Disk space on cloudvirt-wdqs1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.44: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1001&var-datasource=eqiad+prometheus/ops [19:30:46] PROBLEM - puppet last run on cloudvirt-wdqs1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.44: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:31:26] PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:32:26] PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:04] RECOVERY - configured eth on cloudvirt-wdqs1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:33:18] RECOVERY - dhclient process on cloudvirt-wdqs1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:34:08] RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:34:40] (03PS3) 10Dzahn: admins: add tchanders, dmaza and wikigit to deployers [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053) [19:35:00] RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:35:50] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={rsyslog-info,rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d [19:35:50] consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [19:36:26] RECOVERY - DPKG on cloudvirt-wdqs1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:36:36] RECOVERY - Disk space on cloudvirt-wdqs1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1001&var-datasource=eqiad+prometheus/ops [19:36:54] RECOVERY - puppet last run on cloudvirt-wdqs1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:37:14] marostegui: want to deploy https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/525147/ today? [19:38:18] RECOVERY - Check systemd state on cloudvirt-wdqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:42] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [19:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:11] (03PS1) 10Jforrester: Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) [19:43:32] PROBLEM - configured eth on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:43:50] PROBLEM - Check systemd state on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:04] PROBLEM - dhclient process on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:44:06] PROBLEM - Disk space on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1003&var-datasource=eqiad+prometheus/ops [19:44:26] PROBLEM - DPKG on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:45:04] PROBLEM - SSH on cloudvirt-wdqs1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:45:48] PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:46:44] RECOVERY - Check the NTP synchronisation status of timesyncd on cloudvirt-wdqs1001 is OK: OK: synced at Thu 2020-02-27 19:46:43 UTC. https://wikitech.wikimedia.org/wiki/NTP [19:46:48] !log Welcome new deployers Thalia Chan, Moriel Schottlender and Dayllan Maza (Anti-Harrassment-Tools team) [19:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:46] PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:49:05] (03Abandoned) 10Krinkle: [DNM] Test LCStoreArray on mwdebug1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575331 (owner: 10Krinkle) [19:49:18] RECOVERY - SSH on cloudvirt-wdqs1003 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:49:22] RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:49:25] * Krinkle is done testing on mwdebug1001 [19:50:12] RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:50:19] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10greg) Approved from my end. [19:52:31] (03PS1) 10BBlack: Revert "admin: add Brandon's temporary key" [puppet] - 10https://gerrit.wikimedia.org/r/575340 [19:54:08] (03PS1) 10BBlack: Revert "new key for bblack" [homer/public] - 10https://gerrit.wikimedia.org/r/575341 [19:54:12] PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:12] PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:13] (03PS2) 10BBlack: Revert "new key for bblack" [homer/public] - 10https://gerrit.wikimedia.org/r/575341 [19:55:12] RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:56:31] (03PS1) 10Ayounsi: Add Prometheus exporter for Squid [puppet] - 10https://gerrit.wikimedia.org/r/575342 (https://phabricator.wikimedia.org/T245176) [19:56:46] RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [19:57:09] (03PS1) 10Ottomata: eventgate-logging-external - bump image version to 2020-02-25-183224-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/575343 (https://phabricator.wikimedia.org/T226986) [19:58:21] (03PS1) 10Effie Mouzeli: logstash: switch NOSPACE to DATA on apache grok filter [puppet] - 10https://gerrit.wikimedia.org/r/575344 [19:59:43] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - bump image version to 2020-02-25-183224-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/575343 (https://phabricator.wikimedia.org/T226986) (owner: 10Ottomata) [20:00:04] longma and twentyafterfour: How many deployers does it take to do Mediawiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T2000). [20:00:16] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [20:00:16] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [20:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:40] PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:00:40] PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:01:19] (03CR) 10BBlack: [C: 03+2] Revert "admin: add Brandon's temporary key" [puppet] - 10https://gerrit.wikimedia.org/r/575340 (owner: 10BBlack) [20:02:13] (03PS1) 10Jeena Huneidi: all wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575347 [20:02:15] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575347 (owner: 10Jeena Huneidi) [20:02:22] RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:02:22] RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:02:30] PROBLEM - MegaRAID on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:04:10] (03PS3) 10Holger Knust: Added new chart for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) [20:04:24] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575347 (owner: 10Jeena Huneidi) [20:05:37] (03CR) 10Herron: [C: 03+1] logstash: switch NOSPACE to DATA on apache grok filter [puppet] - 10https://gerrit.wikimedia.org/r/575344 (owner: 10Effie Mouzeli) [20:05:57] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.21 refs T233869 [20:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:02] T233869: 1.35.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T233869 [20:07:02] PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:06] PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:18] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [20:07:18] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [20:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:11] (03CR) 10Effie Mouzeli: [C: 03+2] logstash: switch NOSPACE to DATA on apache grok filter [puppet] - 10https://gerrit.wikimedia.org/r/575344 (owner: 10Effie Mouzeli) [20:09:16] (03PS4) 10Holger Knust: Added new chart for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) [20:09:18] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10Dzahn) Hey all, the code change to add your SSH users has been merged. Puppet ran on the bastion hosts and deploy1001. Here are some docs... [20:10:27] 10Operations, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10Dzahn) 05Open→03Resolved a:03Dzahn ` [deploy1001:~] $ id dmaza uid=17497(dmaza) gid=500(wikidev) groups=500(wikidev),705(deployment) [deploy1001:~] $ id w... [20:10:43] (03PS1) 10Ottomata: eventgate-logging-external - fix mediawiki/client/error schema title [deployment-charts] - 10https://gerrit.wikimedia.org/r/575348 (https://phabricator.wikimedia.org/T226986) [20:12:07] (03PS2) 10Jforrester: Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) [20:12:27] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - fix mediawiki/client/error schema title [deployment-charts] - 10https://gerrit.wikimedia.org/r/575348 (https://phabricator.wikimedia.org/T226986) (owner: 10Ottomata) [20:13:57] longma: Looks quiet to me. [20:14:07] agreed [20:14:07] !log pool mw1262 [20:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:33] (03CR) 10BBlack: [C: 03+2] Revert "new key for bblack" [homer/public] - 10https://gerrit.wikimedia.org/r/575341 (owner: 10BBlack) [20:14:50] (03Merged) 10jenkins-bot: Revert "new key for bblack" [homer/public] - 10https://gerrit.wikimedia.org/r/575341 (owner: 10BBlack) [20:15:58] (03CR) 10Jdlrobson: "I think this can land now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [20:16:43] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [20:16:43] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [20:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:42] RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:21:37] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [20:21:37] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [20:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:55] (03PS9) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 [20:22:04] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [20:22:04] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [20:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:11] (03CR) 10Jforrester: [C: 04-1] "> Patch Set 8:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [20:22:17] (03CR) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [20:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:04] PROBLEM - puppet last run on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:24:36] (03PS1) 10Bstorm: cloudstore: Add cloudbackup servers to the ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/575351 [20:24:52] RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [20:26:01] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:26:04] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:48] (03CR) 10Ayounsi: "Still have to run PCC, but this role is still WIP." [puppet] - 10https://gerrit.wikimedia.org/r/575342 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [20:28:58] PROBLEM - configured eth on cloudvirt-wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:29:16] PROBLEM - Check systemd state on cloudvirt-wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:28] PROBLEM - dhclient process on cloudvirt-wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [20:29:34] PROBLEM - Disk space on cloudvirt-wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1003&var-datasource=eqiad+prometheus/ops [20:29:52] PROBLEM - DPKG on cloudvirt-wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:30:33] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:13] (03PS2) 10Bstorm: cloudstore: Add cloudbackup servers to the ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/575351 [20:32:32] (03CR) 10Andrew Bogott: [C: 03+1] cloudstore: Add cloudbackup servers to the ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/575351 (owner: 10Bstorm) [20:32:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:32:59] longma: OK for me to do a deploy? [20:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:20] RoanKattouw: Hi. Would it be possible to take a look at T244617? Thanks [20:34:20] T244617: Please clear two stuck notifications for MABot - https://phabricator.wikimedia.org/T244617 [20:34:35] James_F: yeah go ahead [20:34:42] (03CR) 10Bstorm: [C: 03+2] cloudstore: Add cloudbackup servers to the ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/575351 (owner: 10Bstorm) [20:34:50] Excellent. [20:34:56] (03CR) 10Jforrester: [C: 03+2] Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [20:36:21] (03Merged) 10jenkins-bot: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester) [20:50:58] AaronSchulz: I'm off today, sorry, let's try next week! [20:53:40] RECOVERY - Check the NTP synchronisation status of timesyncd on cloudvirt-wdqs1003 is OK: OK: synced at Thu 2020-02-27 20:53:39 UTC. https://wikitech.wikimedia.org/wiki/NTP [20:53:59] (03PS1) 10Jforrester: wgLogos: Explicitly set 'wordmark' for all Wikipedias which over-ride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575352 [20:56:58] (03PS2) 10Jforrester: wgLogos: Explicitly set 'wordmark' for all Wikipedias which over-ride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575352 [20:58:41] (03CR) 10Jforrester: [C: 03+2] wgLogos: Explicitly set 'wordmark' for all Wikipedias which over-ride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575352 (owner: 10Jforrester) [20:59:42] (03Merged) 10jenkins-bot: wgLogos: Explicitly set 'wordmark' for all Wikipedias which over-ride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575352 (owner: 10Jforrester) [21:02:46] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Merge wgMinervaCustomLogos into wgLogos (duration: 00m 57s) [21:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:09] 10Operations, 10MediaWiki-General, 10observability, 10serviceops: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) One alternative is to adopt a sidecar in the form of statsd_exporter and have it do the heavy lifting of translating MediaWiki and MW Extension metrics in... [21:04:09] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 00m 56s) [21:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:21] !log jforrester@deploy1001 Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org) [21:07:28] Oh dear. [21:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:16] (03PS1) 10Volans: netbox: fine tune log and exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/575353 [21:10:00] * James_F pokes. [21:10:26] PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:10:26] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:10:26] PROBLEM - PHP7 rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:26] PROBLEM - PHP7 rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:34] PROBLEM - Apache HTTP on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:10:40] PROBLEM - Apache HTTP on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:10:42] PROBLEM - Nginx local proxy to apache on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:10:45] Yeah, sorry, this is me. Fixing now. [21:10:46] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1985 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:10:46] PROBLEM - PHP7 rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:10:52] Out-of-sequence deploy. [21:10:54] PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:10:56] !log jforrester@deploy1001 Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org) [21:11:04] PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:11:04] PROBLEM - PHP7 rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:04] PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:11:04] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:11:04] PROBLEM - Apache HTTP on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:11:04] PROBLEM - PHP7 rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:20] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:11:20] PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:11:22] (Canaries are the ones upset.) [21:11:24] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:11:26] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1985 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:11:41] !log jforrester@deploy1001 Synchronized multiversion/MWWikiversions.php: Drop references to four dblists (duration: 00m 35s) [21:11:42] PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:44] PROBLEM - Apache HTTP on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:11:54] PROBLEM - PHP7 rendering on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:11:57] !log jforrester@deploy1001 sync-file aborted: Drop references to four dblists (duration: 00m 05s) [21:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:08] PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1985 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:12:08] PROBLEM - Apache HTTP on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:12:08] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:12:24] PROBLEM - PHP7 rendering on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:12:44] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:12:46] RECOVERY - Nginx local proxy to apache on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:12:48] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 631 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:12:50] RECOVERY - PHP7 rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.212 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:12:58] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:13:05] Sorry for the noise. [21:13:08] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:13:08] RECOVERY - Nginx local proxy to apache on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:13:08] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:13:08] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:13:08] RECOVERY - PHP7 rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:13:08] RECOVERY - PHP7 rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:13:21] !log jforrester@deploy1001 Synchronized dblists/: Add back the deleted dblists to make the canaries quiet (duration: 00m 56s) [21:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:24] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:13:24] RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:13:28] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:13:30] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 632 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:13:38] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix [21:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:48] RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:13:50] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:14:00] RECOVERY - PHP7 rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 73003 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:14:16] RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 73014 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:14:16] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:14:16] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:14:32] RECOVERY - PHP7 rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:14:38] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:14:38] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:14:38] RECOVERY - PHP7 rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:14:38] RECOVERY - PHP7 rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:14:44] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:14:45] !log jforrester@deploy1001 Synchronized multiversion/MWWikiversions.php: Drop references to four dblists to canaries too (duration: 00m 55s) [21:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:08] !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix (duration: 02m 30s) [21:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:08] !log jforrester@deploy1001 Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org) [21:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:51] !log jforrester@deploy1001 Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org) [21:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:20] PROBLEM - Apache HTTP on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:22:20] PROBLEM - Nginx local proxy to apache on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:22:26] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1985 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:22:28] PROBLEM - PHP7 rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:22:34] PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:22:39] !log jforrester@deploy1001 Scap failed!: 8/11 canaries failed their endpoint checks(http://en.wikipedia.org) [21:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:44] PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:22:44] PROBLEM - Apache HTTP on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:22:44] PROBLEM - PHP7 rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:22:44] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:22:44] PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:22:44] PROBLEM - PHP7 rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:22:45] Eurgh. [21:23:41] I hate scap with the passion of a thousand suns. [21:24:12] !log jforrester@deploy1001 Synchronized dblists/: Again, this time without blanked files (duration: 00m 56s) [21:24:12] jforrester@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [21:24:14] (03PS2) 10Aaron Schulz: Set "coalesceKeys" in mc.php to minimize host fan-out by WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575098 [21:24:32] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:24:32] RECOVERY - Nginx local proxy to apache on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:24:36] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 631 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:24:38] RECOVERY - PHP7 rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:24:44] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:24:54] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:24:54] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:24:54] RECOVERY - Nginx local proxy to apache on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:24:54] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:24:54] RECOVERY - PHP7 rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:24:54] RECOVERY - PHP7 rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:24:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:25:21] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Touch the dblists list (duration: 00m 56s) [21:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:29:48] 10Operations, 10netbox: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10crusnov) Some notes from conversations about this: - https://gerrit.wikimedia.org/r/c/operations/puppet/+/571486 is an example of CAS setup . - We are in general agreement as to using apache to query CAS and the... [21:30:42] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f4ae4e0d390: Failed to establish a new connection: [Errno 111] Connection [21:30:42] ://wikitech.wikimedia.org/wiki/Search%23Administration [21:31:38] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:31:46] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:31:48] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:32:58] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:33:28] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:35:04] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: relocating_shards: 0, unassigned_shards: 374, number_of_nodes: 6, cluster_name: production-logstash-eqiad, number_of_data_nodes: 3, active_shards: 750, task_max_waiting_in_queue_millis: 59399, active_shards_percent_as_number: 66.72597864768683, initializing_shards: 0, timed_out: False, status: yello [21:35:04] light_fetch: 1122, number_of_pending_tasks: 52, delayed_unassigned_shards: 0, active_primary_shards: 484 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:35:06] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: initializing_shards: 0, timed_out: False, task_max_waiting_in_queue_millis: 56570, number_of_in_flight_fetch: 1122, active_primary_shards: 484, active_shards: 750, active_shards_percent_as_number: 66.72597864768683, delayed_unassigned_shards: 0, number_of_nodes: 6, cluster_name: production-logstash- [21:35:06] pending_tasks: 36, number_of_data_nodes: 3, unassigned_shards: 374, status: yellow, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:35:17] (03PS1) 10Jforrester: Revert "Merge wgMinervaCustomLogos into wgLogos" and follow-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575356 [21:35:34] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_nodes: 6, initializing_shards: 0, number_of_pending_tasks: 36, timed_out: False, number_of_data_nodes: 3, active_primary_shards: 484, active_shards: 750, status: yellow, cluster_name: production-logstash-eqiad, relocating_shards: 0, number_of_in_flight_fetch: 1122, active_shards_percent_as [21:35:34] 864768683, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 85255, unassigned_shards: 374 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:35:54] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: timed_out: False, initializing_shards: 0, relocating_shards: 0, unassigned_shards: 374, active_shards_percent_as_number: 66.72597864768683, cluster_name: production-logstash-eqiad, task_max_waiting_in_queue_millis: 105976, active_primary_shards: 484, delayed_unassigned_shards: 0, status: yellow, num [21:35:54] number_of_pending_tasks: 33, number_of_data_nodes: 3, number_of_in_flight_fetch: 1122, active_shards: 750 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:36:00] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: task_max_waiting_in_queue_millis: 112780, number_of_data_nodes: 3, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 1122, number_of_pending_tasks: 33, status: yellow, active_primary_shards: 484, active_shards_percent_as_number: 66.72597864768683, initializing_shards: 0, timed_out: False, rel [21:36:01] , active_shards: 750, cluster_name: production-logstash-eqiad, unassigned_shards: 374, number_of_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:37:06] (03CR) 10Jforrester: [C: 03+2] "Suspect flakiness." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575356 (owner: 10Jforrester) [21:38:13] (03Merged) 10jenkins-bot: Revert "Merge wgMinervaCustomLogos into wgLogos" and follow-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575356 (owner: 10Jforrester) [21:39:14] !log jforrester@deploy1001 scap failed: average error rate on 11/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [21:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:29] Forcing. [21:40:07] !log jforrester@deploy1001 Synchronized dblists/: Re-establish dblists everywhere (duration: 00m 33s) [21:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:40] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 1051 threshold =0.34 breach: task_max_waiting_in_queue_millis: 81936, delayed_unassigned_shards: 0, number_of_nodes: 5, active_shards_percent_as_number: 6.494661921708185, timed_out: False, status: red, number_of_in_flight_fetch: 0, unassigned_shards: 1043, number_of_pending_tasks: 76, initializing_shards: [21:41:40] a_nodes: 2, active_primary_shards: 65, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 73 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:41:40] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 1051 threshold =0.34 breach: status: red, cluster_name: production-logstash-eqiad, unassigned_shards: 1043, relocating_shards: 0, task_max_waiting_in_queue_millis: 83481, active_shards: 73, number_of_data_nodes: 2, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 6.494661921708185, number_of_ [21:41:40] of_pending_tasks: 78, active_primary_shards: 65, initializing_shards: 8, delayed_unassigned_shards: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [21:42:06] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Use the four dblists again (duration: 00m 33s) [21:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:10] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 994 threshold =0.34 breach: number_of_nodes: 5, status: red, number_of_pending_tasks: 78, initializing_shards: 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 112472, delayed_unassigned_shards: 0, active_shards: 130, unassigned_shards: 986, relocating_shards: 0, number_of_data_nodes: 2, [21:42:10] cent_as_number: 11.565836298932384, timed_out: False, cluster_name: production-logstash-eqiad, active_primary_shards: 117 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:42:32] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 958 threshold =0.34 breach: active_shards: 166, cluster_name: production-logstash-eqiad, number_of_in_flight_fetch: 0, initializing_shards: 8, relocating_shards: 0, unassigned_shards: 950, number_of_data_nodes: 2, number_of_nodes: 5, active_primary_shards: 153, timed_out: False, task_max_waiting_in_queue_m [21:42:32] tive_shards_percent_as_number: 14.768683274021353, status: red, delayed_unassigned_shards: 0, number_of_pending_tasks: 75 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:42:38] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 947 threshold =0.34 breach: unassigned_shards: 939, timed_out: False, active_shards_percent_as_number: 15.747330960854091, task_max_waiting_in_queue_millis: 141613, relocating_shards: 0, delayed_unassigned_shards: 0, cluster_name: production-logstash-eqiad, status: red, number_of_pending_tasks: 111, number [21:42:38] ive_primary_shards: 164, number_of_in_flight_fetch: 0, active_shards: 177, initializing_shards: 8, number_of_data_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:43:45] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Roll back to setting wgMinervaCustomLogos (duration: 00m 33s) [21:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:03] (03CR) 10C. Scott Ananian: "Largely similar to I892ece88dd56af2758712b0960a62be7a4370715 but LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester) [21:44:36] OK, prod clean and I'm stopping for a bit. [21:44:43] * James_F sighs at ES. [21:46:10] (03CR) 10C. Scott Ananian: [C: 03+1] Parsoid: Use the version of Parsoid in $IP/vendor (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester) [21:46:16] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:46:16] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:46:24] PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:46:34] (03Abandoned) 10C. Scott Ananian: Load Parsoid from the vendor repo, not from an ad-hoc deploy dir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572051 (owner: 10C. Scott Ananian) [21:46:38] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:46:54] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:47:04] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:47:12] PROBLEM - logstash process on logstash2005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash [21:47:58] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:58] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:48:04] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:48:04] PROBLEM - logstash process on logstash1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash [21:48:08] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:48:08] PROBLEM - logstash process on logstash2006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash [21:48:14] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_nodes: 5, number_of_pending_tasks: 0, active_shards: 744, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, timed_out: False, active_primary_shards: 484, number_of_data_nodes: 2, unassigned_shards: 372, cluster_name: production-logstash-eqiad, number_of_in_flight_fetch: 0, initial [21:48:14] active_shards_percent_as_number: 66.19217081850533, delayed_unassigned_shards: 0, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration [21:48:16] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 5, initializing_shards: 8, task_max_waiting_in_queue_millis: 0, number_of_data_nodes: 2, active_primary_shards: 484, relocating_shards: 0, cluster_name: production-logstash-eqiad, number_of_pending_tasks: 0, timed_out: False, active_shards: 744, active_shards_percent [21:48:16] 217081850533, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 0, unassigned_shards: 372 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:48:36] RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [21:48:44] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards: 746, number_of_nodes: 5, relocating_shards: 0, active_primary_shards: 484, cluster_name: production-logstash-eqiad, number_of_data_nodes: 2, number_of_pending_tasks: 0, timed_out: False, delayed_unassigned_shards: 0, u [21:48:44] 370, active_shards_percent_as_number: 66.37010676156584, initializing_shards: 8, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration [21:48:50] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix take 2 [21:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:06] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: relocating_shards: 0, status: yellow, initializing_shards: 8, number_of_data_nodes: 2, active_primary_shards: 484, task_max_waiting_in_queue_millis: 0, active_shards: 747, timed_out: False, number_of_nodes: 5, cluster_name: production-logstash-eqiad, unassigned_shards: 369, active_shards_percent_as_ [21:49:06] 73309609, number_of_in_flight_fetch: 0, number_of_pending_tasks: 0, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:49:14] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_pending_tasks: 0, status: yellow, number_of_nodes: 5, number_of_data_nodes: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, active_primary_shards: 484, unassigned_shards: 369, active_shards: 747, cluster_name: production-logstash-eqiad, delayed_u [21:49:14] 0, timed_out: False, initializing_shards: 8, active_shards_percent_as_number: 66.45907473309609 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:49:16] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [21:49:26] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [21:49:33] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew) [21:50:03] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew) 05Open→03Resolved I have an OS installed on all three of these hosts and I'm experimenting on them in the cloud-v... [21:50:06] 10Operations, 10DC-Ops, 10hardware-requests: eqiad: three clouvirt-wdqs servers for WDQS testing - https://phabricator.wikimedia.org/T232654 (10Andrew) [21:50:24] !log start elasticsearch on logastash1010 [21:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:49] !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix take 2 (duration: 02m 59s) [21:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:14] PROBLEM - Check systemd state on logstash1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:24] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:52:28] RECOVERY - logstash process on logstash1009 is OK: PROCS OK: 1 process with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash [21:52:31] (03PS1) 10Herron: Revert "hieradata: send mw1262's apache logs to logstash" [puppet] - 10https://gerrit.wikimedia.org/r/575358 [21:52:43] herron: better depool it [21:52:54] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:52:54] reverting it will not delete the file [21:52:57] I'll do it [21:53:14] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [21:53:31] effie: ok thanks [21:53:36] !log depool mw1262, suspecting it might have overloaded logstash [21:53:38] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1010 is OK: OK - elasticsearch status production-logstash-eqiad: delayed_unassigned_shards: 0, active_shards: 748, number_of_data_nodes: 3, status: yellow, cluster_name: production-logstash-eqiad, number_of_in_flight_fetch: 0, initializing_shards: 5, unassigned_shards: 371, active_primary_shards: 484, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, num [21:53:38] timed_out: False, active_shards_percent_as_number: 66.54804270462633, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:40] (03Abandoned) 10Herron: Revert "hieradata: send mw1262's apache logs to logstash" [puppet] - 10https://gerrit.wikimedia.org/r/575358 (owner: 10Herron) [21:53:50] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [21:54:34] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:23] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@5a67e6e]: AQS: Minor fix take 3 [21:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:40] (03PS1) 10Jforrester: tests: Assert the 'wordmark' config set-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575361 [22:01:43] (03PS1) 10Jforrester: Only try to set wgLogos['wordmark'] if not already done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575362 [22:01:45] (03PS1) 10Jforrester: Re-try "Merge wgMinervaCustomLogos into wgLogos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575363 [22:01:47] (03PS1) 10Jforrester: Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575364 [22:01:49] (03PS1) 10Jforrester: Stop loading 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575365 [22:01:51] (03PS1) 10Jforrester: Stop defining 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575366 [22:03:26] RECOVERY - logstash process on logstash2006 is OK: PROCS OK: 1 process with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash [22:03:48] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [22:05:38] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [22:06:47] !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@5a67e6e]: AQS: Minor fix take 3 (duration: 07m 24s) [22:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:29] (03PS2) 10Jforrester: Re-try "Merge wgMinervaCustomLogos into wgLogos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575363 [22:07:31] (03PS2) 10Jforrester: Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575364 [22:07:33] (03PS2) 10Jforrester: Stop loading 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575365 [22:07:35] (03PS2) 10Jforrester: Stop defining 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575366 [22:08:03] (03CR) 10Jforrester: [C: 03+2] tests: Assert the 'wordmark' config set-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575361 (owner: 10Jforrester) [22:08:48] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [22:09:08] RECOVERY - logstash process on logstash2005 is OK: PROCS OK: 1 process with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash [22:09:56] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [22:10:12] (03Merged) 10jenkins-bot: tests: Assert the 'wordmark' config set-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575361 (owner: 10Jforrester) [22:21:13] 10Operations, 10ops-eqiad, 10DC-Ops: audit/rebalance power in a5-eqiad - https://phabricator.wikimedia.org/T245655 (10ayounsi) I disabled alerting for that host as it has been alerting/flapping regularly. To be turned back on when fixed: https://librenms.wikimedia.org/device/device=41/tab=edit/ [22:39:04] (03CR) 10C. Scott Ananian: [C: 03+1] Parsoid: Use the version of Parsoid in $IP/vendor (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester) [22:41:13] (03CR) 10Subramanya Sastry: Parsoid: Use the version of Parsoid in $IP/vendor (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester) [22:49:33] !log Manually `scap pull`ed on mw1349 and mw1351 as they were emitting odd errors. [22:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:17] (03CR) 10Jforrester: [C: 03+2] Only try to set wgLogos['wordmark'] if not already done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575362 (owner: 10Jforrester) [22:59:35] (03Merged) 10jenkins-bot: Only try to set wgLogos['wordmark'] if not already done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575362 (owner: 10Jforrester) [23:01:05] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Only try to set wgLogos['wordmark'] if not already done (duration: 00m 58s) [23:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:26] (03CR) 10Jforrester: [C: 03+2] Re-try "Merge wgMinervaCustomLogos into wgLogos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575363 (owner: 10Jforrester) [23:02:27] (03Merged) 10jenkins-bot: Re-try "Merge wgMinervaCustomLogos into wgLogos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575363 (owner: 10Jforrester) [23:04:55] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Merge wgMinervaCustomLogos into wgLogos, take 2 (duration: 00m 56s) [23:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:45] (03CR) 10Jforrester: [C: 03+2] Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575364 (owner: 10Jforrester) [23:06:42] (03Merged) 10jenkins-bot: Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575364 (owner: 10Jforrester) [23:07:30] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 00m 56s) [23:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:19] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set (duration: 00m 56s) [23:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:01] (03PS1) 10Jdlrobson: Drop legacy main page special casing on select projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575376 (https://phabricator.wikimedia.org/T32405) [23:28:12] (03CR) 10jerkins-bot: [V: 04-1] Drop legacy main page special casing on select projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575376 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [23:45:35] (03PS3) 10Jforrester: Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) [23:46:43] (03CR) 10jerkins-bot: [V: 04-1] Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester) [23:48:27] (03PS4) 10Jforrester: Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) [23:53:02] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [23:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:43] (03Abandoned) 10saper: Wikistats v2: go live [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper)