[00:00:01] <stashbot>	 T246212: Move wgULSLanguageDetection variable to CommonSettings.php and document it - https://phabricator.wikimedia.org/T246212
[00:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T0000).
[00:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:00:24] <James_F>	 (Still deploying.)
[00:01:19] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T246212 Stop setting wgULSLanguageDetection in IS, set in CS (duration: 01m 05s)
[00:01:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:02:35] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 03s)
[00:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:05:02] <wikibugs>	 (03PS8) 10Jforrester: Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140)
[00:05:18] <wikibugs>	 (03PS7) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998
[00:05:59] <wikibugs>	 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson Finished cables handing off to chris for remaining steps name rack_name position switch p...
[00:06:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester)
[00:06:15] <wikibugs>	 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Jclark-ctr)
[00:06:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester)
[00:06:31] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Volans) >>! In T243112#5922017, @Papaul wrote: > @Volans i ma trying the downtime command from cookbook to downtime a host before running the auto-...
[00:06:59] <wikibugs>	 (03PS9) 10Jforrester: Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140)
[00:08:38] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester)
[00:09:38] <wikibugs>	 (03Merged) 10jenkins-bot: Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester)
[00:13:15] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T232140: Stop setting wgLogoHD from wgLogos (duration: 01m 05s)
[00:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:13:22] <stashbot>	 T232140: Separate out logo handling into square image logos and long text/wordmark banner logos - https://phabricator.wikimedia.org/T232140
[00:15:05] <wikibugs>	 (03PS2) 10Jforrester: Complete WikiPage/Article split and deprecate Page interface change Article::getTouched to Article::getPage()->getTouched() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572751 (https://phabricator.wikimedia.org/T239975) (owner: 10Art-Baltai)
[00:15:12] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T232140: Merge definition of wgLogos and wgLogo (duration: 01m 04s)
[00:15:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:47] <wikibugs>	 (03PS3) 10Jforrester: extract2: Use Article::getPage()->getTouched(), not Article::getTouched [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572751 (https://phabricator.wikimedia.org/T239975) (owner: 10Art-Baltai)
[00:17:01] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 01m 04s)
[00:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:26] <wikibugs>	 (03PS8) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998
[00:17:37] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) @Volans  Thanks
[00:18:13] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "Waiting for post-wmf.21 tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester)
[00:18:50] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] extract2: Use Article::getPage()->getTouched(), not Article::getTouched [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572751 (https://phabricator.wikimedia.org/T239975) (owner: 10Art-Baltai)
[00:19:46] <wikibugs>	 (03Merged) 10jenkins-bot: extract2: Use Article::getPage()->getTouched(), not Article::getTouched [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572751 (https://phabricator.wikimedia.org/T239975) (owner: 10Art-Baltai)
[00:21:27] <logmsgbot>	 !log jforrester@deploy1001 Synchronized w/extract2.php: T239975: Use Article::getPage()->getTouched(), not Article::getTouched (duration: 01m 04s)
[00:21:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:35] <stashbot>	 T239975: Complete WikiPage/Article split and deprecate Page interface - https://phabricator.wikimedia.org/T239975
[00:24:34] <James_F>	 Prod clear.
[00:24:57] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2009.codfw.wmnet ` The log can be fou...
[00:25:20] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2010.codfw.wmnet ` The log can be fou...
[00:27:55] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10Jclark-ctr) @wiki_willy I have checked our storage room we have no spares  host is 5 years old at the time  drive needed is  a 300gb 15k sas.  current drive in...
[00:39:52] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[00:39:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:42:10] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[00:42:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:06] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2009.codfw.wmnet'] `  and were **ALL** successful.
[00:49:43] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[00:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:17] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10wiki_willy) @aborrero (and @Jclark-ctr for visibility) - it looks this was purchased back in 2014, and past the 5yr server life cycle.  Would it be possible to...
[00:51:07] <wikibugs>	 (03PS1) 10CDanis: Revert "Depool esams (hardware troubles)" [dns] - 10https://gerrit.wikimedia.org/r/575105
[00:52:03] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[00:52:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:55:01] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Revert "Depool esams (hardware troubles)" [dns] - 10https://gerrit.wikimedia.org/r/575105 (owner: 10CDanis)
[00:55:39] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Revert "Depool esams (hardware troubles)" [dns] - 10https://gerrit.wikimedia.org/r/575105 (owner: 10CDanis)
[00:55:45] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2011.codfw.wmnet ` The log can be fou...
[00:56:31] <cdanis>	 !log repool esams 🙌 😎
[00:56:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:52] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2010.codfw.wmnet'] `  and were **ALL** successful.
[01:00:04] <jouncebot>	 twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T0100).
[01:01:41] <wikibugs>	 10Operations, 10MediaWiki-General, 10observability: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) Per @fgiunchedi recommendation, I put together a [[ https://github.com/shdubsh/prometheus_client_php/tree/DirectFileStore | very basic mockup of how DirectFileStore might...
[01:06:34] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.36 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:09:20] <cdanis>	 ^ expected
[01:09:39] <cdanis>	 codfw was getting most US traffic, and now isn't
[01:10:14] <wikibugs>	 (03PS1) 10BryanDavis: webservice-runner: Fix --extra-args handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106
[01:10:45] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[01:10:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:13:09] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[01:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:16:57] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2011.codfw.wmnet'] `  and were **ALL** successful.
[01:18:02] <wikibugs>	 (03CR) 10BryanDavis: [C: 04-1] webservice-runner: Fix --extra-args handling (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106 (owner: 10BryanDavis)
[01:18:10] <wikibugs>	 (03PS2) 10BryanDavis: webservice-runner: Fix --extra-args handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106
[01:19:44] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2012.codfw.wmnet ` The log can be fou...
[01:22:35] <wikibugs>	 (03PS3) 10BryanDavis: webservice-runner: Fix --extra-args handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106
[01:26:56] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] kubernetes: Remove deprecated flag from tcl image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/573823 (owner: 10BryanDavis)
[01:27:08] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] webservice-runner: Fix --extra-args handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106 (owner: 10BryanDavis)
[01:27:34] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: Remove deprecated flag from tcl image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/573823 (owner: 10BryanDavis)
[01:27:38] <XioNoX>	 !log re-enable BGP to telia in esams
[01:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:27:44] <wikibugs>	 (03Merged) 10jenkins-bot: webservice-runner: Fix --extra-args handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575106 (owner: 10BryanDavis)
[01:28:10] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2013.codfw.wmnet ` The log can be fou...
[01:30:14] <wikibugs>	 (03PS1) 10Holger Knust: WIP: changeprop/cpjobqueue: Added new config template for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399)
[01:30:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: changeprop/cpjobqueue: Added new config template for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust)
[01:34:44] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[01:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:52] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:37:01] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[01:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:42:50] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2012.codfw.wmnet'] `  and were **ALL** successful.
[01:43:09] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[01:43:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:45:27] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[01:45:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:25] <wikibugs>	 (03PS2) 10Holger Knust: WIP: changeprop/cpjobqueue: Added new config template for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399)
[01:50:18] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2013.codfw.wmnet'] `  and were **ALL** successful.
[01:51:11] <wikibugs>	 (03PS1) 10BryanDavis: 3rd try at making extra_args handling "better" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575110 (https://phabricator.wikimedia.org/T244894)
[01:51:44] <wikibugs>	 (03CR) 10Holger Knust: "First draft. Will likely need to test some more tomorrow morning. These are just the changes to create the different config files based on" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust)
[01:52:28] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] 3rd try at making extra_args handling "better" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575110 (https://phabricator.wikimedia.org/T244894) (owner: 10BryanDavis)
[01:53:04] <wikibugs>	 (03Merged) 10jenkins-bot: 3rd try at making extra_args handling "better" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575110 (https://phabricator.wikimedia.org/T244894) (owner: 10BryanDavis)
[01:54:10] <wikibugs>	 (03Abandoned) 10BryanDavis: Partially revert changes to improve support for extra_args [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/574236 (https://phabricator.wikimedia.org/T244894) (owner: 10Dapete)
[02:02:20] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10CDanis) FWIW I think it would make sense to at least stick a Wikimedia logo there sooner rather than later.
[02:07:14] <wikibugs>	 (03PS1) 10BryanDavis: d/changelog: prepare 0.64 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575111
[02:07:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] d/changelog: prepare 0.64 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575111 (owner: 10BryanDavis)
[02:08:12] <wikibugs>	 (03CR) 10Ppchelko: "Hm... Hmm..Hmm...Hm..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust)
[02:08:27] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2014.codfw.wmnet ` The log can be fou...
[02:08:33] <wikibugs>	 (03CR) 10BryanDavis: "recheck" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575111 (owner: 10BryanDavis)
[02:08:48] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2015.codfw.wmnet ` The log can be fou...
[02:17:16] <wikibugs>	 10Operations, 10ops-codfw, 10Discovery: elastic2043 has hardware errors that trigger reboots - https://phabricator.wikimedia.org/T243715 (10Papaul) 05Open→03Resolved I Was able to upgrade the IDRAC as well, the Dell tech wasn't very helpful. I clear the log and drained the power on the 13th so what was m...
[02:19:26] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw:fundraising single-cpu misc servers frpig2001,civi2001.pay-lvs200[1-2] - https://phabricator.wikimedia.org/T244950 (10Papaul)
[02:22:44] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[02:22:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:25] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[02:23:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:24:58] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[02:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:26] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[02:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:28:43] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2015.codfw.wmnet'] `  and were **ALL** successful.
[02:32:06] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2017.codfw.wmnet ` The log can be fou...
[02:33:14] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2014.codfw.wmnet'] `  and were **ALL** successful.
[02:35:49] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2016.codfw.wmnet ` The log can be fou...
[02:47:04] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[02:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:49:37] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[02:49:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:50:26] <wikibugs>	 (03PS1) 10CDanis: style: add Wikimedia Foundation logo [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939)
[02:50:46] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[02:50:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:51:17] <wikibugs>	 (03CR) 10CDanis: "Not 100% sure of this, nor how to test, but making an attempt anyway :)" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis)
[02:53:05] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[02:53:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:54:22] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2017.codfw.wmnet'] `  and were **ALL** successful.
[02:57:51] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2016.codfw.wmnet'] `  and were **ALL** successful.
[03:11:41] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2018.codfw.wmnet ` The log can be fou...
[03:12:19] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2019.codfw.wmnet ` The log can be fou...
[03:26:39] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[03:26:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:27:18] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[03:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:28:54] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[03:28:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:31:20] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[03:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:33:40] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2018.codfw.wmnet'] `  and were **ALL** successful.
[03:35:35] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2020.codfw.wmnet ` The log can be fou...
[03:37:08] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2019.codfw.wmnet'] `  and were **ALL** successful.
[03:50:34] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[03:50:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:52:48] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[03:52:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:54:47] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul)
[03:56:17] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) All parse nodes are ready for service just missing parse200[7-8] i think the problem is a wrong mgmt password. I will look into this once a...
[03:57:34] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2020.codfw.wmnet'] `  and were **ALL** successful.
[04:02:14] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:04:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:41:26] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Aklapper) @MoritzMuehlenhoff: Could you please answer the last comment? Thanks!
[05:33:13] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) 05Open→03Stalled
[05:40:31] <icinga-wm>	 PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 113.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1
[05:54:06] <wikibugs>	 (03CR) 10Gergő Tisza: "Scheduled for SWAT tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) (owner: 10Gergő Tisza)
[06:12:07] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1084: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/575130
[06:12:15] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Needs to catch up" [puppet] - 10https://gerrit.wikimedia.org/r/575130 (owner: 10Marostegui)
[06:12:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 (10Marostegui) Thanks John: `    Battery/Capacitor Count: 1    Battery/Capacitor Status: OK `  I have also started MySQL, but it needs catching up. I will take care from here. Thanks again!
[06:40:43] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.079e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[06:55:49] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 8728 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[07:03:59] <wikibugs>	 (03CR) 10Muehlenhoff: Re-enable CAS authentication after enabling CASValidateSAML (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff)
[07:20:11] <icinga-wm>	 RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1
[07:30:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Adapt cross-validate-accounts for system users [puppet] - 10https://gerrit.wikimedia.org/r/575141 (https://phabricator.wikimedia.org/T235161)
[07:31:17] <jynus>	 I am going to depool and create a dcops ticket for db1098
[07:35:23] <icinga-wm>	 PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 116.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1
[07:39:44] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10jijiki) >>! In T245841#5919699, @Joe wrote: >  > What would having all scap proxies also be mcrouter proxies change in terms of the scenario you described above? >   This w...
[07:45:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) 05Open→03Resolved No differences found on s3, s2 tables between source backups and production. Issue fixed.
[07:53:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Unroll Partman configs for Ganeti-based clusters [puppet] - 10https://gerrit.wikimedia.org/r/575202 (https://phabricator.wikimedia.org/T156955)
[07:58:29] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Replace lvs2006 with lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/575203 (https://phabricator.wikimedia.org/T196560)
[08:06:19] <wikibugs>	 (03PS2) 10Aaron Schulz: [DNM] Use DBO_DEFAULT for extension1 since it is not for key/value blob storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977
[08:08:09] <wikibugs>	 (03Abandoned) 10Aaron Schulz: Move duplicated RDBMS host lists to ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/524695 (owner: 10Aaron Schulz)
[08:14:49] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1098 at 50%', diff saved to https://phabricator.wikimedia.org/P10535 and previous config saved to /var/cache/conftool/dbconfig/20200227-081449-jynus.json
[08:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:49] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10MoritzMuehlenhoff) While the keyserver networks have some structural issues which are pending some changes and a number of keys...
[08:26:48] <jynus>	 !log killed SpecialFewestRevisions::reallyDoQuery long running query on db1101:s8, causing lag
[08:26:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:13] <icinga-wm>	 RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1
[08:27:22] <wikibugs>	 (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1003/21107/" [puppet] - 10https://gerrit.wikimedia.org/r/575203 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez)
[08:40:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice! It's missing a few files and since we want to deduplicate this and have a single file for all 3 clusters, I 'd prefer if we don't di" (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan)
[08:47:29] <wikibugs>	 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10MoritzMuehlenhoff) @HMarcus We talked about this in yesterday's Infrastructure Foundations SRE; we would avoid to query the LDAP endpoint of J...
[08:51:23] <wikibugs>	 (03PS2) 10Gehel: airflow: Drop old airflow user/group statement [puppet] - 10https://gerrit.wikimedia.org/r/574538 (owner: 10EBernhardson)
[08:54:23] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] airflow: Drop old airflow user/group statement [puppet] - 10https://gerrit.wikimedia.org/r/574538 (owner: 10EBernhardson)
[09:01:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574888 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond)
[09:03:45] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1098 (s6 & s7)', diff saved to https://phabricator.wikimedia.org/P10536 and previous config saved to /var/cache/conftool/dbconfig/20200227-090344-jynus.json
[09:03:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me! I'll push an updated change to the "de" locale when this is merged (it also needs to be switched from formal to informal" (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond)
[09:05:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: configmaster: Add DNS Discovery discrepancy check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris)
[09:07:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Unroll Partman configs for Ganeti-based clusters [puppet] - 10https://gerrit.wikimedia.org/r/575202 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[09:10:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Unroll Partman configs for Ganeti-based clusters [puppet] - 10https://gerrit.wikimedia.org/r/575202 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[09:12:10] <wikibugs>	 (03PS8) 10Alexandros Kosiaris: configmaster: Add DNS Discovery discrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963
[09:16:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 'll merge this. It has gotten 3 +1s on principle up to now and I have addressed various implementation comments. Hopefully it will prove" [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris)
[09:17:04] <wikibugs>	 10Operations, 10ops-eqiad: db1098 power redundancy lost - https://phabricator.wikimedia.org/T246323 (10jcrespo)
[09:19:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond)
[09:19:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] templates: add initial templates to provide git history [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574888 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond)
[09:21:12] <wikibugs>	 10Operations, 10ops-eqiad: db1098 power redundancy lost - https://phabricator.wikimedia.org/T246323 (10jcrespo) @wiki_willy This could be a power supply failure or other power connectivity issue, there is only so much we can check remotely. We need an onsite check. The server is depooled from production out of...
[09:22:15] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:23:12] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on db1098 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Jcrespo Power redundancy lost. Ticket: https://phabricator.wikimedia.org/T246323 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:23:25] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:26:28] <wikibugs>	 (03PS1) 10Jbond: docker: add docker files to make testing easier [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575208
[09:27:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] docker: add docker files to make testing easier [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575208 (owner: 10Jbond)
[09:27:40] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff)
[09:35:51] <jynus>	 !log upgrade and restart db1084 T246323
[09:35:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:58] <stashbot>	 T246323: db1098 power redundancy lost - https://phabricator.wikimedia.org/T246323
[09:36:50] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis)
[09:37:19] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: icinga: Fix disc_desired_state mode [puppet] - 10https://gerrit.wikimedia.org/r/575209
[09:38:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] icinga: Fix disc_desired_state mode [puppet] - 10https://gerrit.wikimedia.org/r/575209 (owner: 10Alexandros Kosiaris)
[09:41:06] <wikibugs>	 (03CR) 10CDanis: "> Patch Set 1:" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis)
[09:49:09] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: discovery: emit Output for the OK case as well [puppet] - 10https://gerrit.wikimedia.org/r/575211
[09:50:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] discovery: emit Output for the OK case as well [puppet] - 10https://gerrit.wikimedia.org/r/575211 (owner: 10Alexandros Kosiaris)
[09:52:18] <wikibugs>	 (03PS12) 10Muehlenhoff: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747
[09:53:50] <wikibugs>	 (03CR) 10Jbond: style: remove branding (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond)
[09:55:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff)
[09:56:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] style: remove branding (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond)
[10:03:56] <wikibugs>	 (03CR) 10Volans: "Alternative proposal inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575141 (https://phabricator.wikimedia.org/T235161) (owner: 10Muehlenhoff)
[10:04:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix netboot.cfg syntax [puppet] - 10https://gerrit.wikimedia.org/r/575212
[10:05:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Update German login dialogue to refer to Wikimedia Developer Name in other i18ns as well [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575213
[10:06:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Fix netboot.cfg syntax [puppet] - 10https://gerrit.wikimedia.org/r/575212 (owner: 10Muehlenhoff)
[10:06:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix netboot.cfg syntax [puppet] - 10https://gerrit.wikimedia.org/r/575212 (owner: 10Muehlenhoff)
[10:11:26] <wikibugs>	 (03PS1) 10Jbond: templates: add templates base templates used for cas pages [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575214 (https://phabricator.wikimedia.org/T233939)
[10:11:48] <wikibugs>	 10Operations, 10Puppet: Enable strict_hostname_checking on our Puppet nodes - https://phabricator.wikimedia.org/T246327 (10MoritzMuehlenhoff)
[10:13:22] <wikibugs>	 10Operations, 10MediaWiki-General, 10observability: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Joe) As I repeatedly reiterated, the big issue here is prometheus has a model (pull) that really doesn't work well with the PHP request management model, which is shared-nothing.  M...
[10:14:36] <wikibugs>	 10Operations, 10MediaWiki-General, 10observability, 10serviceops: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Joe)
[10:17:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "smokeping: temp cr2-esams disable" [puppet] - 10https://gerrit.wikimedia.org/r/574742 (https://phabricator.wikimedia.org/T246009) (owner: 10Filippo Giunchedi)
[10:19:01] <wikibugs>	 (03PS3) 10Matěj Suchánek: Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969
[10:19:53] <wikibugs>	 (03PS1) 10Jbond: themes: don't use externally hosted js/css files [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575215
[10:21:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, echoing what Keith mentioned re: DNS patch for cas-logstash" [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff)
[10:22:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[10:22:12] <wikibugs>	 (03PS2) 10Jbond: themes: don't use externally hosted js/css files [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575215
[10:24:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "++ to not use external CSS/JS" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575215 (owner: 10Jbond)
[10:25:19] <wikibugs>	 (03PS3) 10Jbond: themes: don't use externally hosted js/css files [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575215 (https://phabricator.wikimedia.org/T246010)
[10:28:30] <wikibugs>	 (03PS11) 10Muehlenhoff: Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026
[10:31:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: logstash, mediawiki: minor fixes in log streaming (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli)
[10:32:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/574993 (owner: 10Volans)
[10:32:39] <wikibugs>	 (03CR) 10Jbond: "lgtm, however looks like it still needs auth from greg" [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053) (owner: 10Dzahn)
[10:38:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Enable CAS endpoint for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff)
[10:43:24] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs: Replace lvs2006 with lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/575203 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez)
[10:45:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: use fleetwide uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918)
[10:47:00] <wikibugs>	 10Operations, 10Puppet: Enable strict_hostname_checking on our Puppet nodes - https://phabricator.wikimedia.org/T246327 (10jbond) I never realised it fell back to the [[  https://puppet.com/docs/puppet/latest/lang_node_definitions.html#matching | fqdn then host + domain facts ]], surprised this hasn't come up...
[10:48:10] <wikibugs>	 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10fgiunchedi) +1 to bumping the limit, although the snipped above has `20` not `200` as the limit for pybal if I'm reading correctly
[10:54:30] <vgutierrez>	 !log replacing lvs2006 with lvs2010 - T196560 T245984
[10:54:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:37] <stashbot>	 T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984
[10:54:37] <stashbot>	 T196560: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560
[10:55:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Re-enable CAS authentication after enabling CASValidateSAML [puppet] - 10https://gerrit.wikimedia.org/r/575026 (owner: 10Muehlenhoff)
[10:57:26] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: enable strict_hostname_checking[1] [puppet] - 10https://gerrit.wikimedia.org/r/575220 (https://phabricator.wikimedia.org/T246327)
[10:58:06] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Caught up, will pool with low load." [puppet] - 10https://gerrit.wikimedia.org/r/575130 (owner: 10Marostegui)
[10:58:16] <wikibugs>	 (03PS2) 10Jcrespo: Revert "db1084: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/575130 (owner: 10Marostegui)
[10:58:44] <vgutierrez>	 !log stop pybal on lvs2003 to let lvs2010 take the traffic for a little bit - T196560 T245984
[10:58:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:55] <vgutierrez>	 !log start pybal on lvs2003 - T196560 T245984
[11:03:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:02] <stashbot>	 T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984
[11:03:03] <stashbot>	 T196560: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560
[11:03:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "db1084: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/575130 (owner: 10Marostegui)
[11:03:56] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update German login dialogue to refer to Wikimedia Developer Name in other i18ns as well [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575213 (owner: 10Muehlenhoff)
[11:09:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Thanks!" [labs/private] - 10https://gerrit.wikimedia.org/r/574806 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan)
[11:13:28] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Vgutierrez)
[11:14:25] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Vgutierrez) a:03Vgutierrez
[11:16:15] <wikibugs>	 (03PS1) 10Raimond Spekking: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575221 (https://phabricator.wikimedia.org/T246330)
[11:23:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/575220 (https://phabricator.wikimedia.org/T246327) (owner: 10Jbond)
[11:25:16] <wikibugs>	 (03CR) 10Muehlenhoff: swift: use fleetwide uid/gid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi)
[11:27:22] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Decomm lvs2006 [puppet] - 10https://gerrit.wikimedia.org/r/575222 (https://phabricator.wikimedia.org/T246329)
[11:31:08] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 (https://phabricator.wikimedia.org/T244843)
[11:32:06] <wikibugs>	 (03PS2) 10Vgutierrez: lvs: Decomm lvs2006 [puppet] - 10https://gerrit.wikimedia.org/r/575222 (https://phabricator.wikimedia.org/T246329)
[11:35:21] <addshore>	 !log pause item migration script at Q50 million T219123
[11:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:26] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[11:35:59] <wikibugs>	 (03PS3) 10Vgutierrez: lvs: Decomm lvs2006 [puppet] - 10https://gerrit.wikimedia.org/r/575222 (https://phabricator.wikimedia.org/T246329)
[11:39:26] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) example of https://gerrit.wikimedia.org/r/c/operations/software/cas-overlay-template/+/575118 {F31646710}
[11:40:14] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 (https://phabricator.wikimedia.org/T244843)
[11:40:36] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "pcc looks sane: https://puppet-compiler.wmflabs.org/compiler1003/21113/" [puppet] - 10https://gerrit.wikimedia.org/r/575222 (https://phabricator.wikimedia.org/T246329) (owner: 10Vgutierrez)
[11:40:57] <wikibugs>	 (03PS2) 10Jbond: style: add Wikimedia Foundation logo [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis)
[11:43:15] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis)
[11:45:44] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1084 at 10% T245621', diff saved to https://phabricator.wikimedia.org/P10538 and previous config saved to /var/cache/conftool/dbconfig/20200227-114542-jynus.json
[11:45:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:51] <stashbot>	 T245621: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621
[11:47:08] <logmsgbot>	 !log vgutierrez@cumin2001 START - Cookbook sre.hosts.decommission
[11:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:47] <logmsgbot>	 !log vgutierrez@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[11:47:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:53] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2006.codfw.wmnet` -  lvs2006.codfw.wmnet (**PASS**)   - Downtime...
[11:48:05] <vgutierrez>	 !log run decommision script against lvs2006.codfw.wmnet - T246329
[11:48:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:10] <stashbot>	 T246329: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329
[11:48:36] <vgutierrez>	 volans: ^^ logging the decomm script without the FQDN is actually... futile
[11:48:52] <volans>	 I know I know...
[11:48:58] <volans>	 :'(
[11:55:11] <wikibugs>	 (03PS1) 10Vgutierrez: Remove lvs2006 production entries [dns] - 10https://gerrit.wikimedia.org/r/575223 (https://phabricator.wikimedia.org/T246329)
[11:56:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Remove lvs2006 production entries [dns] - 10https://gerrit.wikimedia.org/r/575223 (https://phabricator.wikimedia.org/T246329) (owner: 10Vgutierrez)
[11:57:40] <wikibugs>	 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10Volans) The host has been down a week, hence it has been removed from PuppetDB and the Netbox report catched it. Updated Netbox setting it's state to Failed. Please follow...
[11:58:31] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Vgutierrez) a:05Vgutierrez→03Papaul
[11:59:38] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul lvs2006 is all yours, I've filed T246329
[12:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1200).
[12:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[12:00:51] <wikibugs>	 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez)
[12:01:49] * Urbanecm steals SWAT
[12:01:55] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575221 (https://phabricator.wikimedia.org/T246330) (owner: 10Raimond Spekking)
[12:02:56] <wikibugs>	 (03Merged) 10jenkins-bot: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575221 (https://phabricator.wikimedia.org/T246330) (owner: 10Raimond Spekking)
[12:05:06] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: daee105: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons (T246330) (duration: 01m 05s)
[12:05:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:12] <stashbot>	 T246330: Add ids.si.edu  to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T246330
[12:05:41] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Replace lvs2003 with lvs2009 [puppet] - 10https://gerrit.wikimedia.org/r/575224 (https://phabricator.wikimedia.org/T196560)
[12:06:33] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: daee105: Add ids.si.edu to the wgCopyUploadsDomains whitelist of Wikimedia Commons (T246330; take II) (duration: 01m 04s)
[12:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:08] <wikibugs>	 (03PS4) 10Jbond: profile::idp: update profile to use tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/574020 (https://phabricator.wikimedia.org/T240941)
[12:07:26] <wikibugs>	 10Operations, 10Service-Architecture: Many objects in conftool have pooled=yes, weight=0 - https://phabricator.wikimedia.org/T245594 (10Joe)
[12:07:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: install envoy as a forward proxy everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/575225 (https://phabricator.wikimedia.org/T244843)
[12:08:07] <wikibugs>	 (03PS5) 10Hnowlan: Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193)
[12:08:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan)
[12:08:33] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10Vgutierrez)
[12:08:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis)
[12:09:04] <wikibugs>	 (03PS2) 10Vgutierrez: lvs: Replace lvs2003 with lvs2009 [puppet] - 10https://gerrit.wikimedia.org/r/575224 (https://phabricator.wikimedia.org/T196560)
[12:10:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1003/21116/" [puppet] - 10https://gerrit.wikimedia.org/r/575224 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez)
[12:11:15] <Urbanecm>	 !log EU SWAT done
[12:11:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:30] <wikibugs>	 (03PS6) 10Hnowlan: Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193)
[12:13:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] role::mediawiki::common: install envoy as a forward proxy everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/575225 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[12:14:48] <vgutierrez>	 !log replace lvs2003 with lvs2009 - T196560 T245984 T246334
[12:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:56] <stashbot>	 T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984
[12:14:56] <stashbot>	 T196560: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560
[12:14:56] <stashbot>	 T246334: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334
[12:15:21] <wikibugs>	 (03CR) 10Hnowlan: Admin: Add changeprop namespace (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan)
[12:17:07] <addshore>	 jouncebot: now
[12:17:07] <jouncebot>	 For the next 0 hour(s) and 42 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1200)
[12:17:14] <addshore>	 Urbanecm: all done? :)
[12:17:22] <Urbanecm>	 addshore: yes
[12:17:30] <addshore>	 If so I'm going to try that good old item term config read patch to 6 million again :D
[12:17:33] <addshore>	 great
[12:18:30] <logmsgbot>	 !log vgutierrez@cumin2001 START - Cookbook sre.hosts.decommission
[12:18:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:40] <logmsgbot>	 !log vgutierrez@cumin2001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97)
[12:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:57] <vgutierrez>	 FFS, that shouldn't log before I confirm it ;P
[12:19:24] <volans>	 public blame included :D
[12:19:27] <logmsgbot>	 !log vgutierrez@cumin2001 START - Cookbook sre.hosts.decommission
[12:19:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:07] <logmsgbot>	 !log vgutierrez@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[12:20:07] <wikibugs>	 (03PS1) 10Addshore: Read from the new term store again to Q6M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575226
[12:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:14] <wikibugs>	 (03PS2) 10Addshore: Read from the new term store again to Q6M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575226
[12:20:15] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2003.codfw.wmnet` -  lvs2003.codfw.wmnet (**PASS**)   - Downtime...
[12:20:24] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Read from the new term store again to Q6M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575226 (owner: 10Addshore)
[12:21:22] <wikibugs>	 (03Merged) 10jenkins-bot: Read from the new term store again to Q6M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575226 (owner: 10Addshore)
[12:24:06] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q6M (was Q2M) again (T219123) (duration: 01m 45s)
[12:24:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:11] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[12:24:28] <addshore>	 12:24:05 1 hosts had failures restarting php-fpm
[12:24:34] <addshore>	 Urbanecm: ^^ did you also get this?
[12:25:02] <volans>	 addshore: which one?
[12:25:06] <Urbanecm>	 I don't think so
[12:25:07] <addshore>	 volans: https://phabricator.wikimedia.org/P10539
[12:25:12] <addshore>	 debug :)
[12:25:49] <volans>	 effie: might be related to anything ongoing on mwdebug2001?
[12:26:21] * addshore is resyncing now anyway to make sure it deployed, will see if it pops up again
[12:26:48] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Decomm lvs2003 [puppet] - 10https://gerrit.wikimedia.org/r/575227 (https://phabricator.wikimedia.org/T246334)
[12:26:50] <effie>	 volans: mm no, it should be working ok, I can take a look on mwdebug1001 
[12:27:19] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q6M (was Q2M) again (T219123) cachebust? (duration: 01m 17s)
[12:27:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:32] <addshore>	 ^^ my second sync there had no errors or warnings
[12:27:33] <effie>	 oh it is mwdebug2001
[12:30:24] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] style: add Wikimedia Foundation logo [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis)
[12:30:26] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1003/21117/" [puppet] - 10https://gerrit.wikimedia.org/r/575227 (https://phabricator.wikimedia.org/T246334) (owner: 10Vgutierrez)
[12:31:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::mediawiki::common: install envoy as a forward proxy everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/575225 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[12:33:25] <wikibugs>	 (03PS1) 10Addshore: Read from the new term store up to Q8 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575231 (https://phabricator.wikimedia.org/T219123)
[12:33:40] <wikibugs>	 (03PS1) 10Vgutierrez: Remove lvs2003 production entries [dns] - 10https://gerrit.wikimedia.org/r/575232 (https://phabricator.wikimedia.org/T246334)
[12:33:57] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Read from the new term store up to Q8 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575231 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore)
[12:35:01] <wikibugs>	 (03Merged) 10jenkins-bot: Read from the new term store up to Q8 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575231 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore)
[12:35:03] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Remove lvs2003 production entries [dns] - 10https://gerrit.wikimedia.org/r/575232 (https://phabricator.wikimedia.org/T246334) (owner: 10Vgutierrez)
[12:35:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add an-launcher1001 to profile::dumps::distribution [puppet] - 10https://gerrit.wikimedia.org/r/575048 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey)
[12:36:31] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q8M (was Q6M) again (T219123) (duration: 01m 04s)
[12:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:37] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[12:37:45] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q8M (was Q6M) again (T219123) ?cachebust (duration: 01m 03s)
[12:38:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:06] <wikibugs>	 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10ayounsi) The syntax is not obvious, `maximum 1000 teardown 20` means shutdown the session at 1000 but start sending warning logs at 20% of the 1000.
[12:41:11] <XioNoX>	 !log bump BGP prefix-limit on all routers - T246110
[12:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:16] <stashbot>	 T246110: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110
[12:43:50] <icinga-wm>	 PROBLEM - Check systemd state on mw2299 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:45:22] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10Vgutierrez) a:05Vgutierrez→03Papaul
[12:49:03] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul same for lvs2003: T246334  Regarding lvs2007 and lvs2008, please update the NICs FW to the same versions as you did for lvs2009 and lvs...
[12:51:09] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez)
[12:51:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] style: add Wikimedia Foundation logo [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575118 (https://phabricator.wikimedia.org/T233939) (owner: 10CDanis)
[12:52:15] <wikibugs>	 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10ayounsi) 05Open→03Resolved Done.
[12:56:26] <XioNoX>	 !log delete specific tcp-mss on cr2-eqiad:equinix (will cause an interface flap) - T244610
[12:56:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:09] <XioNoX>	 actually no interface flap as there is another "global" one still in effect
[12:57:10] <wikibugs>	 (03PS1) 10Addshore: Read from the new term store, back to Q2 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575235 (https://phabricator.wikimedia.org/T219123)
[12:59:19] <icinga-wm>	 PROBLEM - Check systemd state on mw2258 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:59:19] <icinga-wm>	 PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:01] <icinga-wm>	 PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:35] <icinga-wm>	 PROBLEM - Check systemd state on mw2147 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:53] <icinga-wm>	 PROBLEM - Check systemd state on mw2319 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:53] <icinga-wm>	 PROBLEM - Check systemd state on mw2296 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:56] <cdanis>	 uhm
[13:01:11] <effie>	 is that expected? I have not read any backlog
[13:01:17] <icinga-wm>	 PROBLEM - Check systemd state on mw2183 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:23] <icinga-wm>	 PROBLEM - Check systemd state on mw2195 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:25] <moritzm>	 looking
[13:01:27] <icinga-wm>	 PROBLEM - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:33] <cdanis>	 _joe_: ● envoyproxy.service loaded failed failed Envoy proxy                                                                                                                       
[13:01:34] <effie>	 looking as well
[13:01:36] <elukey>	 seems envoy
[13:01:39] <elukey>	 yesah
[13:01:58] <_joe_>	 yep
[13:02:01] <_joe_>	 no idea why
[13:02:03] <moritzm>	 permission denied on echorestore.log?
[13:02:05] <icinga-wm>	 PROBLEM - Check systemd state on mw2152 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:02:07] <_joe_>	 it worked on the first two hosts
[13:02:07] <effie>	 oh it is the envoy hippy 
[13:02:09] <icinga-wm>	 PROBLEM - Check systemd state on mw2182 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:02:10] <effie>	 lol 
[13:02:10] <_joe_>	 oh sigh yes
[13:02:13] <addshore>	 am I okay to revert my config change (seems unrelated to those problems) ? :)
[13:02:17] <cdanis>	 _joe_: shall I stop puppet on appservers?
[13:02:21] <icinga-wm>	 PROBLEM - Check systemd state on mw2288 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:02:27] <_joe_>	 cdanis: no, it's just spam
[13:02:33] <_joe_>	 don't worry, it will be fixed now
[13:02:38] <cdanis>	 ok
[13:02:41] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Read from the new term store, back to Q2 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575235 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore)
[13:02:51] * addshore takes that as a yes
[13:02:52] <cdanis>	 addshore: +1
[13:02:56] <addshore>	 :) ty
[13:03:01] <icinga-wm>	 PROBLEM - Check systemd state on mw2223 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:01] <icinga-wm>	 PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:09] <icinga-wm>	 PROBLEM - Check systemd state on mw2280 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:09] <icinga-wm>	 PROBLEM - Check systemd state on mw2202 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:16] <_joe_>	 addshore: yes, go on if you need to revert
[13:03:25] <icinga-wm>	 PROBLEM - Check systemd state on mw2310 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:26] <_joe_>	 this is just noise
[13:03:34] <addshore>	 just noise, white noise
[13:03:37] <icinga-wm>	 PROBLEM - Check systemd state on mw2137 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:45] <icinga-wm>	 PROBLEM - Check systemd state on mw2265 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:46] <moritzm>	 yeah, second puppet run fixes it
[13:03:49] <_joe_>	 !log re-stopped puppet on codfw
[13:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:56] <_joe_>	 moritzm: it shouldn't
[13:03:57] <icinga-wm>	 PROBLEM - Check systemd state on mw2287 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:04:01] <icinga-wm>	 PROBLEM - Check systemd state on mw2255 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:04:07] <icinga-wm>	 PROBLEM - Check systemd state on mw2277 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:04:08] <wikibugs>	 (03Merged) 10jenkins-bot: Read from the new term store, back to Q2 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575235 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore)
[13:04:26] <_joe_>	 anyways, fixing it
[13:04:27] <moritzm>	 yeah, you're right
[13:04:29] <icinga-wm>	 PROBLEM - Check systemd state on mw2141 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:04:42] <_joe_>	 I have no idea how or why this is happening
[13:04:43] <icinga-wm>	 PROBLEM - Check systemd state on mw2240 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:26] <wikibugs>	 (03PS1) 10Jbond: ldap properties: add ldap config file to ease local testing [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575236
[13:05:29] <icinga-wm>	 PROBLEM - Check systemd state on mw2252 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:35] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q2M (was Q8M) again (T219123) (duration: 01m 03s)
[13:05:39] <icinga-wm>	 PROBLEM - Check systemd state on mw2256 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:41] <icinga-wm>	 PROBLEM - Check systemd state on mw2274 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:47] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[13:06:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] templates: add templates base templates used for cas pages [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575214 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond)
[13:06:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] themes: don't use externally hosted js/css files [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575215 (https://phabricator.wikimedia.org/T246010) (owner: 10Jbond)
[13:06:13] <icinga-wm>	 PROBLEM - Check systemd state on mw2201 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:15] <icinga-wm>	 PROBLEM - Check systemd state on mw2225 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:21] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] ldap properties: add ldap config file to ease local testing [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575236 (owner: 10Jbond)
[13:06:31] <icinga-wm>	 PROBLEM - Check systemd state on mw2315 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:35] <icinga-wm>	 PROBLEM - Check systemd state on mw2253 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:35] <icinga-wm>	 PROBLEM - Check systemd state on mw2231 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:48] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q2M (was Q8M) again (T219123) ?cachebust (duration: 01m 03s)
[13:06:49] <icinga-wm>	 RECOVERY - Check systemd state on mw2310 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:52] <addshore>	 thats me done
[13:07:01] <icinga-wm>	 RECOVERY - Check systemd state on mw2137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:04] <_joe_>	 !log restarting envoy, after chowning the log files, on all codfw mw servers where it was installed
[13:07:07] <icinga-wm>	 RECOVERY - Check systemd state on mw2152 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:07] <icinga-wm>	 RECOVERY - Check systemd state on mw2147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:11] <icinga-wm>	 RECOVERY - Check systemd state on mw2265 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:11] <icinga-wm>	 RECOVERY - Check systemd state on mw2252 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:13] <icinga-wm>	 RECOVERY - Check systemd state on mw2182 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:21] <icinga-wm>	 RECOVERY - Check systemd state on mw2256 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:21] <icinga-wm>	 RECOVERY - Check systemd state on mw2258 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:23] <icinga-wm>	 RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:23] <icinga-wm>	 RECOVERY - Check systemd state on mw2274 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:27] <icinga-wm>	 RECOVERY - Check systemd state on mw2287 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:27] <icinga-wm>	 RECOVERY - Check systemd state on mw2288 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:29] <icinga-wm>	 RECOVERY - Check systemd state on mw2255 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:31] <icinga-wm>	 RECOVERY - Check systemd state on mw2319 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:31] <icinga-wm>	 RECOVERY - Check systemd state on mw2296 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:37] <icinga-wm>	 RECOVERY - Check systemd state on mw2277 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:01] <icinga-wm>	 RECOVERY - Check systemd state on mw2141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:01] <icinga-wm>	 RECOVERY - Check systemd state on mw2183 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:01] <icinga-wm>	 RECOVERY - Check systemd state on mw2201 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:03] <icinga-wm>	 RECOVERY - Check systemd state on mw2225 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:07] <icinga-wm>	 RECOVERY - Check systemd state on mw2195 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:11] <icinga-wm>	 RECOVERY - Check systemd state on mw2223 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:11] <icinga-wm>	 RECOVERY - Check systemd state on mw2157 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:13] <icinga-wm>	 RECOVERY - Check systemd state on mw2248 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:15] <icinga-wm>	 RECOVERY - Check systemd state on mw2240 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:17] <icinga-wm>	 RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:19] <icinga-wm>	 RECOVERY - Check systemd state on mw2315 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:19] <icinga-wm>	 RECOVERY - Check systemd state on mw2280 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:19] <icinga-wm>	 RECOVERY - Check systemd state on mw2202 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:23] <icinga-wm>	 RECOVERY - Check systemd state on mw2253 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:23] <icinga-wm>	 RECOVERY - Check systemd state on mw2231 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:08:39] <wikibugs>	 (03PS1) 10Ayounsi: esams/knams: remove prepending and tcp-mss clamping [homer/public] - 10https://gerrit.wikimedia.org/r/575237
[13:09:34] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] esams/knams: remove prepending and tcp-mss clamping [homer/public] - 10https://gerrit.wikimedia.org/r/575237 (owner: 10Ayounsi)
[13:09:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] esams/knams: remove prepending and tcp-mss clamping [homer/public] - 10https://gerrit.wikimedia.org/r/575237 (owner: 10Ayounsi)
[13:10:03] <wikibugs>	 (03Merged) 10jenkins-bot: esams/knams: remove prepending and tcp-mss clamping [homer/public] - 10https://gerrit.wikimedia.org/r/575237 (owner: 10Ayounsi)
[13:11:55] <XioNoX>	 !log esams/knams rollback tcp-mss camping and prepending
[13:11:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:11] <cdanis>	 !log s/camping/clamping/
[13:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:49] <icinga-wm>	 RECOVERY - Check systemd state on mw2299 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:37] <wikibugs>	 (03PS2) 10Filippo Giunchedi: swift: use fleetwide uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918)
[13:24:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: swift: use fleetwide uid/gid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi)
[13:28:37] <_joe_>	 !log installing envoy in eqiad too
[13:28:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi)
[13:34:50] <wikibugs>	 (03CR) 10Muehlenhoff: "@Keith, Filippo: Yes, this isn't complete yet, there will a additional followup commits for Varnish and DNS as well." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff)
[13:35:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:35:30] <wikibugs>	 10Operations, 10netops: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10ayounsi) p:05Triage→03Medium
[13:36:31] <_joe_>	 uh
[13:36:43] <_joe_>	 godog: ^^ can be related to my changes?
[13:36:45] <wikibugs>	 (03PS4) 10Muehlenhoff: Enable CAS endpoint for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/574499
[13:37:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:38:03] <_joe_>	 ahem
[13:38:05] <godog>	 _joe_: mhh I'm not sure, that means icinga_exporter couldn't be queried in time
[13:38:18] <_joe_>	 oh ok icinga_exporter
[13:38:53] <godog>	 thinking out loud, icinga restarts shouldn't affect it either
[13:48:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: use fleetwide uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/575217 (https://phabricator.wikimedia.org/T123918) (owner: 10Filippo Giunchedi)
[13:52:36] <wikibugs>	 10Operations, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10fgiunchedi)
[13:58:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of typos, otherwise LGTM." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan)
[13:59:38] <wikibugs>	 (03PS1) 10Elukey: cdh::hive: improve jar file match regex to work with BigTop [puppet] - 10https://gerrit.wikimedia.org/r/575242 (https://phabricator.wikimedia.org/T244499)
[14:02:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cdh::hive: improve jar file match regex to work with BigTop [puppet] - 10https://gerrit.wikimedia.org/r/575242 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey)
[14:03:26] <Urbanecm>	 jouncebot: next
[14:03:26] <jouncebot>	 In 2 hour(s) and 56 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1700)
[14:03:30] <Urbanecm>	 jouncebot: now
[14:03:30] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 56 minute(s)
[14:04:56] <wikibugs>	 (03CR) 10Gilles: "The Thumbor configuration for tests is different than the configuration of Thumbor as installed by the Debian packages." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER)
[14:05:24] <wikibugs>	 (03PS1) 10Urbanecm: Increase arwiki's WikiGap throttle lift to 400 accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575243 (https://phabricator.wikimedia.org/T246092)
[14:05:34] <wikibugs>	 (03CR) 10Gilles: "*if you made a mistake" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER)
[14:05:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Increase arwiki's WikiGap throttle lift to 400 accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575243 (https://phabricator.wikimedia.org/T246092) (owner: 10Urbanecm)
[14:06:34] <wikibugs>	 (03Merged) 10jenkins-bot: Increase arwiki's WikiGap throttle lift to 400 accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575243 (https://phabricator.wikimedia.org/T246092) (owner: 10Urbanecm)
[14:07:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::parsoid: base it on role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/575244
[14:08:25] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: 7e3a57a: Increase arwiki WikiGap throttle lift to 400 accounts (T246092) (duration: 01m 05s)
[14:09:31] <Urbanecm>	 where are you, stashbot ?
[14:09:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:18] <stashbot>	 T246092: Temporary lift IP cap for WikiGap edit-a-thon at Khawarizmi College in 5 March 2020 - https://phabricator.wikimedia.org/T246092
[14:11:12] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: role::parsoid: base it on role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/575244
[14:17:14] <wikibugs>	 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Gehel)
[14:20:23] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10bd808) >>! In T244986#5922092, @wiki_willy wrote: > @aborrero (and @Jclark-ctr for visibility) - it looks this was purchased back in 2014, and past the 5yr ser...
[14:20:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/21120/" [puppet] - 10https://gerrit.wikimedia.org/r/575244 (owner: 10Giuseppe Lavagetto)
[14:25:35] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Reimage lvs4007 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575246 (https://phabricator.wikimedia.org/T245984)
[14:26:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable CAS endpoint for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff)
[14:26:37] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] d/changelog: prepare 0.64 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575111 (owner: 10BryanDavis)
[14:27:06] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage lvs4007 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575246 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez)
[14:29:46] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: prepare 0.64 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/575111 (owner: 10BryanDavis)
[14:33:30] <wikibugs>	 (03CR) 10Ottomata: "I (obviously) think my approach using main_app.name as the 'main app' identifier is a good one.  I think Alex does too.  There might be so" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust)
[14:33:52] <wikibugs>	 (03PS7) 10Hnowlan: Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193)
[14:35:20] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs4007.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20...
[14:35:50] <vgutierrez>	 !log reimage lvs4007 with buster - T245984
[14:35:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:56] <stashbot>	 T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984
[14:37:37] <wikibugs>	 (03CR) 10Ottomata: "BTW, the stuff I did for eventgate chart does change some of the conventions we've been using already.  I think the phab ticket Petr linke" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust)
[14:38:22] <wikibugs>	 (03CR) 10Hnowlan: Admin: Add changeprop namespace (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan)
[14:49:57] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[14:50:57] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[14:53:29] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[14:53:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:52] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:55:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:03] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1084 at 50% T245621', diff saved to https://phabricator.wikimedia.org/P10542 and previous config saved to /var/cache/conftool/dbconfig/20200227-150302-jynus.json
[15:03:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:10] <stashbot>	 T245621: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621
[15:03:37] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs4007.ulsfo.wmnet'] `  and were **ALL** successful.
[15:06:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "I tend to agree with Petr, this approach feels wrong. Having a flag to switch from cpjobqueue to changeprop essentially says that we can't" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust)
[15:07:28] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] toolforge: upgrade elasticsearch and add debian buster support [puppet] - 10https://gerrit.wikimedia.org/r/574527 (https://phabricator.wikimedia.org/T236606) (owner: 10Jhedden)
[15:10:33] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Reimage lvs4006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575256 (https://phabricator.wikimedia.org/T245984)
[15:11:10] <wikibugs>	 (03PS2) 10Vgutierrez: install_server,lvs: Reimage lvs4006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575256 (https://phabricator.wikimedia.org/T245984)
[15:12:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server,lvs: Reimage lvs4006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575256 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez)
[15:13:00] <wikibugs>	 (03PS3) 10Vgutierrez: install_server,lvs: Reimage lvs4006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575256 (https://phabricator.wikimedia.org/T245984)
[15:15:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add discovery for eventgate-analytics-external [puppet] - 10https://gerrit.wikimedia.org/r/573366 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata)
[15:15:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Route intake-analytics.wm.org to eventgate-analytics-external [puppet] - 10https://gerrit.wikimedia.org/r/573369 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata)
[15:16:33] <wikibugs>	 (03PS1) 10Elukey: role::search::airflow: allow analytics-admins to ssh to hosts [puppet] - 10https://gerrit.wikimedia.org/r/575260
[15:16:37] <wikibugs>	 10Operations, 10DBA: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10jcrespo) I will let @Marostegui put it back to 100% and do the full revert and finishing touches + resolv.
[15:17:49] <vgutierrez>	 !log reimage lvs4006 with buster - T245984
[15:17:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:56] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs4006.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20...
[15:17:56] <stashbot>	 T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984
[15:18:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::search::airflow: allow analytics-admins to ssh to hosts [puppet] - 10https://gerrit.wikimedia.org/r/575260 (owner: 10Elukey)
[15:19:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "lgtm -- I'd like Krenair to confirm that there aren't any <4 puppetmasters still living in cloud-vps" [puppet] - 10https://gerrit.wikimedia.org/r/575220 (https://phabricator.wikimedia.org/T246327) (owner: 10Jbond)
[15:22:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Question: nothing really contacts change-prop via HTTP, except maybe service-checker that does a simple health check. Do we even want to" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan)
[15:23:52] <moritzm>	 !log installing curl security updates on stretch/buster
[15:23:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:20] <wikibugs>	 (03CR) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli)
[15:24:22] <wikibugs>	 (03PS3) 10Hnowlan: changeprop: add hierdata k8s entries [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193)
[15:24:43] <wikibugs>	 (03CR) 10Hnowlan: changeprop: add hierdata k8s entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan)
[15:26:09] <wikibugs>	 (03CR) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli)
[15:28:54] <wikibugs>	 (03PS7) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472)
[15:29:10] <moritzm>	 !log restarting mw canaries to pick up curl update
[15:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:31:18] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.35.0-wmf.21/extensions/ConfirmEdit/includes/auth/CaptchaPreAuthenticationProvider.php: T245280 (duration: 01m 05s)
[15:31:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:24] <stashbot>	 T245280: logstash_formatter_key_conflict in mediawiki logs - https://phabricator.wikimedia.org/T245280
[15:32:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:32:49] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/ConfirmEdit/includes/auth/CaptchaPreAuthenticationProvider.php: T245280 (duration: 01m 04s)
[15:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "A few comments, but overall this looks pretty close to being ready. Nice!" (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway)
[15:33:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:35:02] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[15:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:24] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[15:37:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:37:22] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:37:41] <wikibugs>	 (03PS1) 10Addshore: Read from the new term store up to Q4 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575264 (https://phabricator.wikimedia.org/T219123)
[15:37:46] <addshore>	 jouncebot: now
[15:37:47] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 22 minute(s)
[15:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:12] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::search::airflow: fix group require [puppet] - 10https://gerrit.wikimedia.org/r/575265
[15:38:22] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[15:39:56] <wikibugs>	 (03CR) 10Elukey: "Puppet has been broken for a long time due to this bug, let's check it when doing changes :)" [puppet] - 10https://gerrit.wikimedia.org/r/575265 (owner: 10Elukey)
[15:40:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::search::airflow: fix group require [puppet] - 10https://gerrit.wikimedia.org/r/575265 (owner: 10Elukey)
[15:41:03] <wikibugs>	 (03PS1) 10Jhedden: toolforge: add prometheus exporter for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/575266
[15:44:12] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] toolforge: add prometheus exporter for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/575266 (owner: 10Jhedden)
[15:44:42] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs4006.ulsfo.wmnet'] `  and were **ALL** successful.
[15:45:27] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Due by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10RobH)
[15:45:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Due by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10RobH)
[15:46:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Due by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10RobH)
[15:46:56] <wikibugs>	 (03PS2) 10Effie Mouzeli: thumbor: remove nginx code leftovers [puppet] - 10https://gerrit.wikimedia.org/r/572033
[15:50:24] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: ProductionServices: switch search to use envoy instead of nginx [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575268 (https://phabricator.wikimedia.org/T244843)
[15:50:28] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: ProductionServices: use local http proxy for parsoid, parsoidphp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575269 (https://phabricator.wikimedia.org/T244843)
[15:50:30] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: ProductionServices: use the local proxy for sessionstore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575270 (https://phabricator.wikimedia.org/T244843)
[15:50:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Looks ok, except the fact we're still not specifying where to connect to Redis. the secrets will get us the redis path, but the redis ur" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust)
[15:52:35] <moritzm>	 !log installing python-pysaml security updates
[15:52:36] <wikibugs>	 (03PS8) 10Effie Mouzeli: logstash, mediawiki: minor fixes in log streaming [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472)
[15:52:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:10] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone hooks: create .wmcloud.org project domain during project creation [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174)
[15:54:12] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC for mwdebug https://puppet-compiler.wmflabs.org/compiler1001/21122/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli)
[15:54:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] ProductionServices: switch search to use envoy instead of nginx [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575268 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[15:55:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] keystone hooks: create .wmcloud.org project domain during project creation [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174) (owner: 10Andrew Bogott)
[15:56:02] <moritzm>	 !log installing python-django updates (packaged Debian version)
[15:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:54] <volans>	 it seems a calmer moment, I'll merge the icinga patch that should be noop
[15:57:02] <wikibugs>	 (03CR) 10Volans: [C: 03+2] icinga: fix use of stale unpuppetized check files [puppet] - 10https://gerrit.wikimedia.org/r/574993 (owner: 10Volans)
[15:58:39] <wikibugs>	 (03PS2) 10Andrew Bogott: keystone hooks: create .wmcloud.org project domain during project creation [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174)
[15:59:00] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Read from the new term store up to Q4 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575264 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore)
[15:59:12] <addshore>	 take 50
[15:59:16] * addshore lost count
[15:59:45] <moritzm>	 !log installing e2fsck security updates on buster
[15:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:07] <wikibugs>	 (03Merged) 10jenkins-bot: Read from the new term store up to Q4 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575264 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore)
[16:00:10] <wikibugs>	 (03PS3) 10Andrew Bogott: keystone hooks: create .wmcloud.org project domain during project creation [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174)
[16:02:30] <effie>	 !log disable puppet on thumbor*
[16:02:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:08] <wikibugs>	 10Operations: Integrate Buster 10.3 point update - https://phabricator.wikimedia.org/T244693 (10MoritzMuehlenhoff)
[16:03:46] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs4006 [puppet] - 10https://gerrit.wikimedia.org/r/575274 (https://phabricator.wikimedia.org/T245984)
[16:05:24] <moritzm>	 !log installing python3.7 security updates on Buster
[16:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:50] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q4M for the new term store for clients (was Q2M) + warm db1126 caches (T219123) (duration: 01m 04s)
[16:05:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:57] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[16:07:26] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q4M for the new term store for clients (was Q2M) + warm db1126 caches (T219123) cache bust (duration: 01m 04s)
[16:07:27] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2007.codfw.wmnet ` The log can be fou...
[16:07:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs4006 [puppet] - 10https://gerrit.wikimedia.org/r/575274 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez)
[16:08:00] <addshore>	 !log begin warming wikidata term cache on db1126 for Q4-6 million T219123
[16:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:57] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` parse2008.codfw.wmnet ` The log can be fou...
[16:09:10] <vgutierrez>	 !log re-enable BGP in lvs4006 - T245984
[16:09:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:35] <stashbot>	 T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984
[16:10:45] <Urbanecm>	 !log mwscript extensions/AbuseFilter/maintenance/fixOldLogEntries.php --wiki=mediawikiwiki --verbose (T228655)
[16:10:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:51] <stashbot>	 T228655: Dry-run fixOldLogEntries for AbuseFilter - https://phabricator.wikimedia.org/T228655
[16:11:47] <wikibugs>	 (03PS4) 10Andrew Bogott: keystone hooks: create .wmcloud.org project domain during project creation [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174)
[16:11:53] <Urbanecm>	 !log foreachwiki extensions/AbuseFilter/maintenance/fixOldLogEntries.php --verbose started (T228655)
[16:11:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:03] <wikibugs>	 (03PS1) 10Vgutierrez: install_server,lvs: Reimage lvs4005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575277 (https://phabricator.wikimedia.org/T245984)
[16:12:52] <papaul>	 !log rebooting parse2009 to clear memory error 
[16:12:53] <wikibugs>	 (03CR) 10Holger Knust: "Redis is defined in the defaults and to keep it consistent with the other KVs, I overrode only the non-default items. Still add to the ind" [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust)
[16:12:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:30] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server,lvs: Reimage lvs4005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575277 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez)
[16:14:49] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: remove nginx code leftovers [puppet] - 10https://gerrit.wikimedia.org/r/572033 (owner: 10Effie Mouzeli)
[16:15:19] <vgutierrez>	 effie: may I merge that?
[16:15:57] <icinga-wm>	 PROBLEM - Host parse2009 is DOWN: PING CRITICAL - Packet loss = 100%
[16:16:17] <vgutierrez>	 effie: :? :)
[16:16:45] <icinga-wm>	 RECOVERY - Host parse2009 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
[16:18:45] <wikibugs>	 (03PS5) 10Andrew Bogott: keystone hooks: create new default domains for new projects [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174)
[16:20:36] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs4005.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20...
[16:20:38] <vgutierrez>	 !log reimage lvs4005 with buster - T245984
[16:20:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:44] <stashbot>	 T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984
[16:21:45] <moritzm>	 !log installing wget security updates on jessie
[16:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112) (owner: 10Dzahn)
[16:22:18] <wikibugs>	 (03PS4) 10Dzahn: site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112)
[16:22:24] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[16:22:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:08] <wikibugs>	 10Operations: Integrate Buster 10.3 point update - https://phabricator.wikimedia.org/T244693 (10MoritzMuehlenhoff)
[16:23:53] <jynus>	 no backups on apt1001
[16:23:55] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime
[16:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:05] <jynus>	 moritzm: ^yours, new host?
[16:24:10] <mutante>	 jynus: new host
[16:24:12] <wikibugs>	 (03PS6) 10Andrew Bogott: keystone hooks: create new default domains for new projects [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174)
[16:24:18] <mutante>	 just got the role the other day
[16:24:21] <jynus>	 cool, then no issue- only a warning
[16:24:25] <moritzm>	 jynus: yeah, those are replacing install*
[16:24:30] <mutante>	 cool, it does have backup::host class
[16:24:40] <jynus>	 let me know if interested to do a manual run
[16:24:46] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[16:24:48] <jynus>	 when it has something meaning full to test it
[16:24:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:59] <mutante>	 the data was rsynced from install1002 .. so we already have a backup of that
[16:24:59] <jynus>	 otherwise it will be done automatically at the begining of the month
[16:25:06] <mutante>	 i think March 1st is enbough
[16:25:41] <jynus>	 cool, just announcing to feel free to ask me any operations in the future
[16:26:10] <jynus>	 specially when moving hosts, it is super-easy to do a custom run
[16:26:24] <jynus>	 it is literally just executing "run" :-D
[16:26:26] <wikibugs>	 10Operations: Log the real X-Client-IP - https://phabricator.wikimedia.org/T246348 (10Reedy)
[16:27:10] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[16:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:32] <mutante>	 jynus: thank you, sounds good
[16:29:31] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2007.codfw.wmnet'] `  and were **ALL** successful.
[16:31:55] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['parse2008.codfw.wmnet'] `  and were **ALL** successful.
[16:34:47] <wikibugs>	 (03PS10) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832
[16:36:42] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[16:36:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:09] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 (owner: 10Bstorm)
[16:39:02] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:40:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:43:24] <icinga-wm>	 PROBLEM - Host lvs4005 is DOWN: PING CRITICAL - Packet loss = 100%
[16:45:22] <icinga-wm>	 RECOVERY - Host lvs4005 is UP: PING OK - Packet loss = 0%, RTA = 74.65 ms
[16:45:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Good enough is good enough™" [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli)
[16:46:45] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs4005.ulsfo.wmnet'] `  and were **ALL** successful.
[16:48:04] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/575295 (https://phabricator.wikimedia.org/T245984)
[16:49:10] <addshore>	 !log END warming wikidata term cache on db1126 for Q4-6 million T219123 (pass1)
[16:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:15] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[16:49:22] <addshore>	 !log START warming wikidata term cache on db1126 for Q4-6 million T219123 (pass2)
[16:49:24] <wikibugs>	 (03PS2) 10SBassett: Deployment group audit [puppet] - 10https://gerrit.wikimedia.org/r/574869 (https://phabricator.wikimedia.org/T237696)
[16:49:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:42] <volans>	 !log temporarily decommented external check for icinga2001. Restarting Icinga on icinga2001
[16:50:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:02] <wikibugs>	 (03PS1) 10Bstorm: labstore: finish setting up the firewall on the old primary cluster [puppet] - 10https://gerrit.wikimedia.org/r/575296 (https://phabricator.wikimedia.org/T165136)
[16:52:28] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/575295 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez)
[16:53:25] <wikibugs>	 (03PS6) 10Krinkle: Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz)
[16:55:02] <vgutierrez>	 !log re-enable BGP in lvs4005 - T245984
[16:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:08] <stashbot>	 T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984
[16:55:10] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz)
[16:56:02] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] labstore: finish setting up the firewall on the old primary cluster [puppet] - 10https://gerrit.wikimedia.org/r/575296 (https://phabricator.wikimedia.org/T165136) (owner: 10Bstorm)
[16:56:14] <wikibugs>	 (03Merged) 10jenkins-bot: Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz)
[16:56:47] <wikibugs>	 (03PS3) 10SBassett: Deployment group audit [puppet] - 10https://gerrit.wikimedia.org/r/574869 (https://phabricator.wikimedia.org/T237696)
[16:57:44] <icinga-wm>	 PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: First Paint desktop, First Paint mobile, INM Satisfaction Ratio, Load Event End overall, Response Start desktop, Response Start mobile, Varnish frontend hit rate. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[16:58:38] <godog>	 the failing grafana checks are known, patch incoming
[16:59:48] <bd808>	 !log Disabled new account creation on wikitech via horrible TitleBlacklist hack.
[17:00:04] <jouncebot>	 godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1700).
[17:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[17:03:31] <vgutierrez>	 !log reimage lvs5003 with buster - T245984
[17:03:42] <godog>	 oh no wikibugs is gone :(
[17:04:08] <vgutierrez>	 hmm stashbot as well?
[17:04:21] <vgutierrez>	 yup :_(
[17:04:38] <godog>	 true :(
[17:04:56] <vgutierrez>	 now nobody cares for that I write here
[17:04:58] <vgutierrez>	 :_(
[17:05:26] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/mc.php: I119aff6312463 - allow_tcp_nagle_delay:off (duration: 01m 05s)
[17:05:34] <icinga-wm>	 PROBLEM - Host lvs5003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:05:45] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1087 at 10% T232446', diff saved to https://phabricator.wikimedia.org/P10546 and previous config saved to /var/cache/conftool/dbconfig/20200227-170543-jynus.json
[17:06:36] <icinga-wm>	 RECOVERY - Host lvs5003 is UP: PING OK - Packet loss = 0%, RTA = 231.33 ms
[17:06:41] <vgutierrez>	 uh...
[17:06:48] <vgutierrez>	 lvs5003 should be downtimed by the reimage script
[17:07:48] <vgutierrez>	 doing it manually...
[17:08:47] <Lucas_WMDE>	 someone should probably re!log those logmsgbot things once stashbot is back?
[17:09:24] <Lucas_WMDE>	 (I’m leaving soon so I probably can’t do it myself)
[17:11:01] <godog>	 grafana alerts should be recovering soon
[17:11:46] <addshore>	 !log END warming wikidata term cache on db1126 for Q4-6 million T219123 (pass2)
[17:11:54] <addshore>	 jouncebot: now
[17:11:54] <jouncebot>	 For the next 0 hour(s) and 48 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1700)
[17:14:26] <addshore>	 are we having a netsplit or something that I missed? or are all the bots just dead?
[17:14:42] <jynus>	 wikitech technical issues
[17:15:19] <addshore>	 hmm wikibugs also gone, im guessing that is related?
[17:15:28] <godog>	 it is yeah
[17:15:42] * addshore was just about to move from Q4 million to Q6 million for wikidata item term reads on clients
[17:16:03] <jynus>	 I think we can go on
[17:16:05] <addshore>	 :)
[17:16:19] * addshore announces doing a thing in mediawiki-config
[17:16:20] <jynus>	 we havent lost observability
[17:16:39] <addshore>	 for reference my thingy is https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/575301/
[17:17:38] <wikibugs>	 (03Merged) 10jenkins-bot: Read from the new term store up to Q6 million for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575301 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore)
[17:18:07] <addshore>	 :)
[17:18:27] <addshore>	 !log (relog FROM 5:11) END warming wikidata term cache on db1126 for Q4-6 million T219123 (pass2)
[17:18:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:33] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[17:18:43] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10Papaul) |Servers|NIC1|NIC2|NIC3|NIC4|Note| |lvs2007| |lvs2008|asw-b2  xe-2/0/45|'A7': xe-7/0/45|C2': xe-2/0/45|D2': xe-2/0/46| using the same cables lvs2006 was using...
[17:19:04] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q6M for the new term store for clients (was Q4M) + warm db1126 caches (T219123) (duration: 01m 04s)
[17:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:10] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10Vgutierrez) 05Stalled→03Resolved a:03Vgutierrez
[17:19:12] <wikibugs>	 10Operations, 10Acme-chief, 10Traffic: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10Vgutierrez)
[17:19:43] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10wiki_willy) @bd808 - thanks for providing the background context around these.  I hit up Rob to prioritize T243471 more. (quotes being submitted soon)  Also, w...
[17:20:20] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q6M for the new term store for clients (was Q4M) + warm db1126 caches (T219123) cache bust (duration: 01m 04s)
[17:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:51] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] wikiworkshop.org: switch DNS to our text endpoint [dns] - 10https://gerrit.wikimedia.org/r/575304 (https://phabricator.wikimedia.org/T242374) (owner: 10BBlack)
[17:21:00] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Switch unified cert vendor to Let's Encrypt on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/575305 (https://phabricator.wikimedia.org/T230687)
[17:24:13] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on labstore1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:24:22] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "LGTM -- aiui the ruby clientip shuffling is expected to be temporary until the apache logs are more consistently formatted" [puppet] - 10https://gerrit.wikimedia.org/r/575000 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli)
[17:24:27] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10wiki_willy)
[17:24:58] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10wiki_willy) T246365 created for ordering the replacement drive.   Thanks, Willy
[17:25:45] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on labstore1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:26:30] <wikibugs>	 (03CR) 10Vgutierrez: "pcc looks sane: https://puppet-compiler.wmflabs.org/compiler1002/21130/" [puppet] - 10https://gerrit.wikimedia.org/r/575305 (https://phabricator.wikimedia.org/T230687) (owner: 10Vgutierrez)
[17:27:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-2] "merge on Monday :)" [puppet] - 10https://gerrit.wikimedia.org/r/575305 (https://phabricator.wikimedia.org/T230687) (owner: 10Vgutierrez)
[17:29:37] <wikibugs>	 (03PS1) 10Bstorm: labstore: one more nfs ferm fix for the primary cluster [puppet] - 10https://gerrit.wikimedia.org/r/575307
[17:30:17] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1087 at 20% T232446', diff saved to https://phabricator.wikimedia.org/P10547 and previous config saved to /var/cache/conftool/dbconfig/20200227-173017-jynus.json
[17:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:23] <stashbot>	 T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446
[17:30:43] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[17:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:25] <vgutierrez>	 !log (from 17:03) reimage lvs5003 with buster - T245984
[17:31:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:30] <stashbot>	 T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984
[17:31:34] <addshore>	 !log START warming wikidata term cache on db1126 for Q6-8 million T219123 (pass1)
[17:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:39] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[17:31:42] <vgutierrez>	 addshore: 05:11 != 17:11 ;P
[17:31:53] <addshore>	 vgutierrez: i realized that once I sent ti >.>
[17:32:14] <jynus>	 addshore: so happy so far?
[17:32:25] <addshore>	 jynus: 6 million seems to be behaving
[17:33:00] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:33:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:18] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul)
[17:33:27] <addshore>	 jynus: the one thing I still see that makes me think it might be a contributing factor is the sending data state of some processes https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&refresh=30s&fullscreen&panelId=37
[17:33:52] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) 05Open→03Resolved @Dzahn @joe all 20 servers ready for service
[17:34:02] <jynus>	 sending that is a bit meaningless, like idle
[17:34:08] <jynus>	 it means "it is doing something"
[17:34:10] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] labstore: one more nfs ferm fix for the primary cluster [puppet] - 10https://gerrit.wikimedia.org/r/575307 (owner: 10Bstorm)
[17:34:13] <addshore>	 jynus: okay :P
[17:34:28] <jynus>	 will correlate with spikes
[17:34:45] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime
[17:34:46] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:52] <jynus>	 that graph is only useful for the total
[17:35:02] <jynus>	 and for waiting/altering/updating
[17:37:11] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw: rack/setup/install 3 new payments server for frack - https://phabricator.wikimedia.org/T244169 (10Papaul) a:05Papaul→03Jgreen @Jgreen All yours.
[17:37:56] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5003 is CRITICAL: CRITICAL: 0 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[17:38:10] <icinga-wm>	 RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[17:39:50] <vgutierrez>	 sigh... damn icinga
[17:40:15] <wikibugs>	 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Ferm rules for labstore1004/1005 NFS hosts - https://phabricator.wikimedia.org/T165136 (10Bstorm) 05Open→03Resolved a:03Bstorm The cluster runs ferm rules now.
[17:40:23] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[17:40:25] <wikibugs>	 (03PS1) 10Andrew Bogott: puppetmasters:  remove the install-console script [puppet] - 10https://gerrit.wikimedia.org/r/575309
[17:40:28] <icinga-wm>	 PROBLEM - Host lvs5003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:40:48] <vgutierrez>	 ^^ that host is being reimaged and theoretically is downtimed :/
[17:40:54] <effie>	 !log stop and mask all nginx on thumbor* 
[17:40:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:00] <effie>	 !log enable puppet on thumbor*
[17:41:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:19] <wikibugs>	 10Operations, 10Icinga, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Volans) 05Open→03Resolved a:03Volans Resolving as this is an old task and that issue has been fixed, despite we've a similar one right now.
[17:41:46] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[17:43:22] <icinga-wm>	 RECOVERY - Host lvs5003 is UP: PING OK - Packet loss = 0%, RTA = 231.37 ms
[17:44:08] <wikibugs>	 (03PS1) 10Andrew Bogott: Add cloudvirt-wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/575312 (https://phabricator.wikimedia.org/T221631)
[17:45:43] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs5003.eqsin.wmnet'] `  and were **ALL** successful.
[17:46:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add cloudvirt-wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/575312 (https://phabricator.wikimedia.org/T221631) (owner: 10Andrew Bogott)
[17:47:19] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez)
[17:48:45] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5003 is OK: OK: 16 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[17:49:31] <ebernhardson>	 !log delete commonswiki_file_1582685980 from cloudelastic-chi, reindex failed and commonswiki_file_first is still primary
[17:49:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:03] <addshore>	 !log resume item migration script at Q50 million T219123 (batch size of 100, 1s sleep)
[17:52:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:09] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[17:52:14] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul)
[17:54:03] <wikibugs>	 (03PS1) 10Andrew Bogott: add host hiera info for cloudvirt-wdqs100[123] [puppet] - 10https://gerrit.wikimedia.org/r/575315
[17:56:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] add host hiera info for cloudvirt-wdqs100[123] [puppet] - 10https://gerrit.wikimedia.org/r/575315 (owner: 10Andrew Bogott)
[17:57:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Redis is defined in the defaults and to keep it consistent with the other KVs, I overrode only the non-default items. Still add to the i" [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust)
[17:59:34] <wikibugs>	 (03PS8) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854)
[18:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1800).
[18:00:16] <wikibugs>	 10Operations, 10ops-eqiad: db1098 power redundancy lost - https://phabricator.wikimedia.org/T246323 (10jcrespo) Please ping me if it is not something as obvious as a cable and need it down to prepare the host.
[18:03:19] <wikibugs>	 (03PS1) 10Dzahn: fix IP address for apt2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/575318 (https://phabricator.wikimedia.org/T224576)
[18:06:25] <wikibugs>	 (03PS2) 10Dzahn: fix IP address for apt2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/575318 (https://phabricator.wikimedia.org/T224576)
[18:07:43] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:55] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:42] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1003 is UP: PING WARNING - Packet loss = 37%, RTA = 0.26 ms
[18:09:52] <icinga-wm>	 PROBLEM - configured eth on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[18:12:59] <wikibugs>	 (03PS1) 10Herron: add profile::idp::client::httpd hiera for elk7 env [puppet] - 10https://gerrit.wikimedia.org/r/575320 (https://phabricator.wikimedia.org/T234854)
[18:13:02] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[18:13:30] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:13:34] <icinga-wm>	 PROBLEM - Check systemd state on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:14:07] <wikibugs>	 (03PS9) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854)
[18:14:38] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100%
[18:15:12] <icinga-wm>	 PROBLEM - DPKG on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[18:15:44] <icinga-wm>	 PROBLEM - DPKG on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[18:15:46] <icinga-wm>	 PROBLEM - dhclient process on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[18:15:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] fix IP address for apt2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/575318 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn)
[18:18:45] <wikibugs>	 10Operations, 10ops-eqiad: db1098 power redundancy lost - https://phabricator.wikimedia.org/T246323 (10Jclark-ctr) 05Open→03Resolved @jcrespo  Reseated power cable Psu powered on  closing ticket
[18:19:46] <wikibugs>	 (03PS5) 10Dzahn: site: add new parsoid nodes with spare role [puppet] - 10https://gerrit.wikimedia.org/r/575100 (https://phabricator.wikimedia.org/T243112)
[18:20:02] <icinga-wm>	 PROBLEM - configured eth on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[18:20:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10RobH)
[18:20:16] <elukey>	 !log upload prometheus-mcrouter-exporter 0.1.0+git20200227-1 to stretch-wikimedia
[18:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:21] <elukey>	 correcting a bug --^
[18:20:42] <icinga-wm>	 PROBLEM - Disk space on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1003&var-datasource=eqiad+prometheus/ops
[18:21:18] <wikibugs>	 (03PS1) 10Bstorm: toolforge-kubernetes: shut down the old maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/575322 (https://phabricator.wikimedia.org/T214513)
[18:21:29] <addshore>	 !log END warming wikidata term cache on db1126 for Q6-8 million T219123 (pass1) (will do 2 more passes tomorrow)
[18:21:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:34] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[18:22:30] <icinga-wm>	 PROBLEM - dhclient process on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[18:22:48] <icinga-wm>	 RECOVERY - IPMI Sensor Status on db1098 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:25:12] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[18:25:37] <mutante>	 could we have some downtimes for these?
[18:26:19] <volans>	 mutante: icinga is also having problems with the external command, FYI, I'm troubleshooting
[18:27:20] <mutante>	 volans: oh..the passive checks from FR?  thank you! *nod*
[18:27:46] <mutante>	 that almost sounded like firewall change
[18:27:59] <mutante>	 if restarting nsca did not fix it ..yet they are sending packets as normal
[18:28:38] <volans>	 yep
[18:28:42] <volans>	 it's the command file
[18:28:48] <volans>	 some pass some not
[18:28:52] <volans>	 same for downtimes
[18:28:52] <mutante>	 oh
[18:29:18] <wikibugs>	 (03PS1) 10Bstorm: toolforge: remove the ancient version of kubectl [puppet] - 10https://gerrit.wikimedia.org/r/575325 (https://phabricator.wikimedia.org/T214513)
[18:29:27] <volans>	 both 1001 and 2001, but 2001 stopped 25m ago
[18:29:45] <mutante>	 both.. that's weird
[18:30:30] <volans>	 I'll go with a full rstart, didn't solve the issue before on 2001 but the last restart did
[18:31:03] <mutante>	 i was about to suggest that.. i had vague memories of a similar thing and that fixed it ..yea
[18:31:15] <volans>	 !log restarting icinga on icinga1001, command file randomly discarding commands
[18:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:32] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[18:33:44] <icinga-wm>	 PROBLEM - MegaRAID on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:33:51] <volans>	 have to waid now
[18:33:54] <volans>	 *wait
[18:34:03] <mutante>	 ok
[18:34:05] <wikibugs>	 (03CR) 10Herron: "puppet is currently broken on the elk7 collectors because this hiera is missing, so no diff is displayed, but it LGTM https://puppet-compi" [puppet] - 10https://gerrit.wikimedia.org/r/575320 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[18:34:58] <icinga-wm>	 PROBLEM - Check systemd state on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:35:19] <volans>	 mutante: the downtime for now worked
[18:35:23] <volans>	 so promising
[18:35:42] <volans>	 but need to check logs for awol
[18:35:44] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:35:44] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:37:03] <volans>	 so far recovering, but I'm not happy as we don't have a real root cause, I had tried also the debug log a bit
[18:37:06] <icinga-wm>	 PROBLEM - Disk space on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1002&var-datasource=eqiad+prometheus/ops
[18:37:31] <wikibugs>	 (03PS2) 10Bstorm: toolforge-kubernetes: shut down the old maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/575322 (https://phabricator.wikimedia.org/T214513)
[18:37:32] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:37:54] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@357ff5c]: Refinery using 0.0.115
[18:37:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:57] <volans>	 gehel: what's up with all those cloudvirt-wdqs?
[18:38:04] <icinga-wm>	 PROBLEM - Long running screen/tmux on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[18:38:40] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:39:17] <mutante>	 volans: cool, thanks. yea, sucks to not have a root cause but as long as it happens just every few months i guess we can deal with it
[18:39:18] <icinga-wm>	 PROBLEM - configured eth on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[18:39:38] <mutante>	 sounds like last time indeed
[18:39:57] <volans>	 depends which last time
[18:40:05] <volans>	 because one of the last times was nsca the issue, and we fixed that
[18:40:25] <mutante>	 yea.. that was different. that's when we had to kill all the nsca processes afair
[18:40:57] <mutante>	 the one where icinga dropped some commands from the cmdfile ..like now
[18:41:03] <mutante>	 or did not notice them 
[18:41:14] <mutante>	 and restarting icinga itself fixed it
[18:41:42] <icinga-wm>	 PROBLEM - dhclient process on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[18:42:17] <wikibugs>	 (03PS4) 10Matěj Suchánek: Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969
[18:43:33] <mutante>	 !log adding parse2* machines to puppet
[18:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:20] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[18:48:05] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@357ff5c]: Refinery using 0.0.115 (duration: 10m 11s)
[18:48:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:14] <wikibugs>	 (03CR) 10Herron: [C: 03+2] add profile::idp::client::httpd hiera for elk7 env [puppet] - 10https://gerrit.wikimedia.org/r/575320 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[18:49:22] <wikibugs>	 (03PS10) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854)
[18:49:40] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10greg) Approved for all 3 from my end.
[18:49:41] <wikibugs>	 (03CR) 10Greg Grossmeier: [C: 03+1] "Approved." [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053) (owner: 10Dzahn)
[18:49:52] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[18:51:00] <elukey>	 !log upgrade prometheus-mcrouter-exporter to 0.1.0+git20200227-1 on hosts
[18:51:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:30] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100%
[18:54:12] <wikibugs>	 (03CR) 10Herron: "some comments inline and updated pcc https://puppet-compiler.wmflabs.org/compiler1003/21137/" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[18:55:12] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[18:55:34] <icinga-wm>	 PROBLEM - MegaRAID on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:58:25] <wikibugs>	 (03PS1) 10Dzahn: installserver: add apt2001 to fail over servers for APT repo sync [puppet] - 10https://gerrit.wikimedia.org/r/575327
[18:59:59] <wikibugs>	 (03CR) 10Bstorm: "The nature of the timer::job type requires all that mess to be in there even though this is just an ensure => absent" [puppet] - 10https://gerrit.wikimedia.org/r/575322 (https://phabricator.wikimedia.org/T214513) (owner: 10Bstorm)
[19:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T1900).
[19:00:04] <jouncebot>	 tgr: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:25] <tgr>	 o/
[19:00:46] <icinga-wm>	 PROBLEM - Long running screen/tmux on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[19:01:53] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100%
[19:02:46] <icinga-wm>	 PROBLEM - MegaRAID on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:03:44] <wikibugs>	 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10chasemp) @HMarcus @MoritzMuehlenhoff Can we all agree on 6 weeks notice to SRE before going live as a control here?  If so I think that closes...
[19:04:58] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1001 is UP: PING WARNING - Packet loss = 93%, RTA = 0.28 ms
[19:05:04] * Krinkle takes mwdebug1001 for performance testing
[19:05:06] <tgr>	 I can self-SWAT
[19:05:26] * Krinkle waits for tgr 
[19:05:28] <Krinkle>	 ok :)
[19:05:47] <tgr>	 Krinkle: will it interfere? I can use 1002
[19:06:09] <Krinkle>	 tgr: a scap sync will override my local changes so yeah I'll wait
[19:06:34] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100%
[19:06:36] <wikibugs>	 (03PS2) 10Gergő Tisza: Enable articletopic: search keyword in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559)
[19:06:41] <logmsgbot>	 !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
[19:06:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:11] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Enable articletopic: search keyword in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) (owner: 10Gergő Tisza)
[19:07:13] <wikibugs>	 (03PS1) 10Effie Mouzeli: hieradata: send mw1262's apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/575329 (https://phabricator.wikimedia.org/T244472)
[19:07:23] <gehel>	 volans: I'm late, but those should be new servers in WMCS that will be dedicated to wdqs testing.
[19:08:02] <gehel>	 Atm they are just the virtualization hosts, nothing wdqs specific there yet
[19:08:42] <wikibugs>	 (03Merged) 10jenkins-bot: Enable articletopic: search keyword in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) (owner: 10Gergő Tisza)
[19:10:41] <wikibugs>	 (03CR) 10Herron: [C: 03+1] hieradata: send mw1262's apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/575329 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli)
[19:12:29] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1002/21138/mw1262.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/575329 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli)
[19:12:33] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: send mw1262's apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/575329 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli)
[19:12:57] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@357ff5c] (thin): Refinery using 0.0.115
[19:13:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:04] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@357ff5c] (thin): Refinery using 0.0.115 (duration: 00m 07s)
[19:13:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Degraded RAID on analytics1044 - https://phabricator.wikimedia.org/T245910 (10Nuria) 05Open→03Resolved
[19:14:20] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[19:14:22] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:14:24] <wikibugs>	 (03PS1) 10Krinkle: [DNM] Test LCStoreArray on mwdebug1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575331
[19:15:28] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[19:16:46] <wikibugs>	 (03CR) 10Nuria: Make normalized request count available in Turnilo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575035 (https://phabricator.wikimedia.org/T241162) (owner: 10Milimetric)
[19:16:48] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[19:16:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey)
[19:17:30] <mutante>	 !log ganeti2001 - removing VM apt2001 to re-create it after IP change
[19:17:31] <logmsgbot>	 !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
[19:17:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:42] <effie>	 !log depool mw1262
[19:17:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:52] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[19:17:57] <logmsgbot>	 !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[19:18:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:18] <wikibugs>	 (03PS2) 10Dzahn: admins: add tchanders, dmaza and wikigit to deployers [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053)
[19:20:32] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574634|Enable articletopic: search keyword in CirrusSearch (T240559)]] (duration: 01m 05s)
[19:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:37] <stashbot>	 T240559: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword - https://phabricator.wikimedia.org/T240559
[19:20:44] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[19:21:40] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:21:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admins: add tchanders, dmaza and wikigit to deployers [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053) (owner: 10Dzahn)
[19:21:57] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: once more for good measure (duration: 01m 03s)
[19:22:00] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms
[19:22:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:11] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, the existing file will need a manual cleanup on those 7 hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/575309 (owner: 10Andrew Bogott)
[19:22:31] <tgr>	 Krinkle: all yours
[19:22:42] <Krinkle>	 tgr: thanks
[19:23:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] puppetmasters:  remove the install-console script [puppet] - 10https://gerrit.wikimedia.org/r/575309 (owner: 10Andrew Bogott)
[19:24:26] <icinga-wm>	 PROBLEM - configured eth on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[19:25:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "cleanup done" [puppet] - 10https://gerrit.wikimedia.org/r/575309 (owner: 10Andrew Bogott)
[19:26:06] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[19:26:21] <logmsgbot>	 !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
[19:26:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:50] <icinga-wm>	 PROBLEM - dhclient process on cloudvirt-wdqs1001 is CRITICAL: connect to address 10.64.20.44 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[19:27:40] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[19:28:05] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 5 others: Public schema.wikimedia.org endpoint for schema.svc - https://phabricator.wikimedia.org/T233630 (10Nuria) 05Open→03Resolved
[19:28:32] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[19:29:42] <icinga-wm>	 PROBLEM - Check systemd state on cloudvirt-wdqs1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.44: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:29:56] <icinga-wm>	 PROBLEM - DPKG on cloudvirt-wdqs1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.44: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[19:30:07] <wikibugs>	 (03PS11) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854)
[19:30:08] <icinga-wm>	 PROBLEM - Disk space on cloudvirt-wdqs1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.44: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1001&var-datasource=eqiad+prometheus/ops
[19:30:46] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt-wdqs1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.44: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:31:26] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[19:32:26] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:33:04] <icinga-wm>	 RECOVERY - configured eth on cloudvirt-wdqs1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[19:33:18] <icinga-wm>	 RECOVERY - dhclient process on cloudvirt-wdqs1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[19:34:08] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[19:34:40] <wikibugs>	 (03PS3) 10Dzahn: admins: add tchanders, dmaza and wikigit to deployers [puppet] - 10https://gerrit.wikimedia.org/r/575101 (https://phabricator.wikimedia.org/T246053)
[19:35:00] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[19:35:50] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={rsyslog-info,rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d
[19:35:50] <icinga-wm>	 consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[19:36:26] <icinga-wm>	 RECOVERY - DPKG on cloudvirt-wdqs1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[19:36:36] <icinga-wm>	 RECOVERY - Disk space on cloudvirt-wdqs1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1001&var-datasource=eqiad+prometheus/ops
[19:36:54] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt-wdqs1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:37:14] <AaronSchulz>	 marostegui: want to deploy https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/525147/ today?
[19:38:18] <icinga-wm>	 RECOVERY - Check systemd state on cloudvirt-wdqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:38:42] <logmsgbot>	 !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' .
[19:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:11] <wikibugs>	 (03PS1) 10Jforrester: Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055)
[19:43:32] <icinga-wm>	 PROBLEM - configured eth on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[19:43:50] <icinga-wm>	 PROBLEM - Check systemd state on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:44:04] <icinga-wm>	 PROBLEM - dhclient process on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[19:44:06] <icinga-wm>	 PROBLEM - Disk space on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1003&var-datasource=eqiad+prometheus/ops
[19:44:26] <icinga-wm>	 PROBLEM - DPKG on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[19:45:04] <icinga-wm>	 PROBLEM - SSH on cloudvirt-wdqs1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:45:48] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[19:46:44] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on cloudvirt-wdqs1001 is OK: OK: synced at Thu 2020-02-27 19:46:43 UTC. https://wikitech.wikimedia.org/wiki/NTP
[19:46:48] <mutante>	 !log Welcome new deployers Thalia Chan, Moriel Schottlender and Dayllan Maza (Anti-Harrassment-Tools team)
[19:46:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:46] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:49:05] <wikibugs>	 (03Abandoned) 10Krinkle: [DNM] Test LCStoreArray on mwdebug1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575331 (owner: 10Krinkle)
[19:49:18] <icinga-wm>	 RECOVERY - SSH on cloudvirt-wdqs1003 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:49:22] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[19:49:25] * Krinkle is done testing on mwdebug1001
[19:50:12] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[19:50:19] <wikibugs>	 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10greg) Approved from my end.
[19:52:31] <wikibugs>	 (03PS1) 10BBlack: Revert "admin: add Brandon's temporary key" [puppet] - 10https://gerrit.wikimedia.org/r/575340
[19:54:08] <wikibugs>	 (03PS1) 10BBlack: Revert "new key for bblack" [homer/public] - 10https://gerrit.wikimedia.org/r/575341
[19:54:12] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:54:12] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[19:54:13] <wikibugs>	 (03PS2) 10BBlack: Revert "new key for bblack" [homer/public] - 10https://gerrit.wikimedia.org/r/575341
[19:55:12] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[19:56:31] <wikibugs>	 (03PS1) 10Ayounsi: Add Prometheus exporter for Squid [puppet] - 10https://gerrit.wikimedia.org/r/575342 (https://phabricator.wikimedia.org/T245176)
[19:56:46] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[19:57:09] <wikibugs>	 (03PS1) 10Ottomata: eventgate-logging-external - bump image version to 2020-02-25-183224-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/575343 (https://phabricator.wikimedia.org/T226986)
[19:58:21] <wikibugs>	 (03PS1) 10Effie Mouzeli: logstash: switch NOSPACE to DATA on apache grok filter [puppet] - 10https://gerrit.wikimedia.org/r/575344
[19:59:43] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - bump image version to 2020-02-25-183224-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/575343 (https://phabricator.wikimedia.org/T226986) (owner: 10Ottomata)
[20:00:04] <jouncebot>	 longma and twentyafterfour: How many deployers does it take to do Mediawiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200227T2000).
[20:00:16] <logmsgbot>	 !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
[20:00:16] <logmsgbot>	 !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
[20:00:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:40] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:00:40] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[20:01:19] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "admin: add Brandon's temporary key" [puppet] - 10https://gerrit.wikimedia.org/r/575340 (owner: 10BBlack)
[20:02:13] <wikibugs>	 (03PS1) 10Jeena Huneidi: all wikis to 1.35.0-wmf.21  refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575347
[20:02:15] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.35.0-wmf.21  refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575347 (owner: 10Jeena Huneidi)
[20:02:22] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[20:02:22] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[20:02:30] <icinga-wm>	 PROBLEM - MegaRAID on cloudvirt-wdqs1002 is CRITICAL: connect to address 10.64.20.45 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:04:10] <wikibugs>	 (03PS3) 10Holger Knust: Added new chart for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399)
[20:04:24] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.21  refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575347 (owner: 10Jeena Huneidi)
[20:05:37] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: switch NOSPACE to DATA on apache grok filter [puppet] - 10https://gerrit.wikimedia.org/r/575344 (owner: 10Effie Mouzeli)
[20:05:57] <logmsgbot>	 !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.21  refs T233869
[20:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:02] <stashbot>	 T233869: 1.35.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T233869
[20:07:02] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[20:07:06] <icinga-wm>	 PROBLEM - Host cloudvirt-wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:07:18] <logmsgbot>	 !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
[20:07:18] <logmsgbot>	 !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
[20:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:11] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] logstash: switch NOSPACE to DATA on apache grok filter [puppet] - 10https://gerrit.wikimedia.org/r/575344 (owner: 10Effie Mouzeli)
[20:09:16] <wikibugs>	 (03PS4) 10Holger Knust: Added new chart for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399)
[20:09:18] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10Dzahn) Hey all, the code change to add your SSH users has been merged. Puppet ran on the bastion hosts and deploy1001.  Here are some docs...
[20:10:27] <wikibugs>	 10Operations, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10Dzahn) 05Open→03Resolved a:03Dzahn ` [deploy1001:~] $ id dmaza uid=17497(dmaza) gid=500(wikidev) groups=500(wikidev),705(deployment) [deploy1001:~] $ id w...
[20:10:43] <wikibugs>	 (03PS1) 10Ottomata: eventgate-logging-external - fix mediawiki/client/error schema title [deployment-charts] - 10https://gerrit.wikimedia.org/r/575348 (https://phabricator.wikimedia.org/T226986)
[20:12:07] <wikibugs>	 (03PS2) 10Jforrester: Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055)
[20:12:27] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - fix mediawiki/client/error schema title [deployment-charts] - 10https://gerrit.wikimedia.org/r/575348 (https://phabricator.wikimedia.org/T226986) (owner: 10Ottomata)
[20:13:57] <James_F>	 longma: Looks quiet to me.
[20:14:07] <longma>	 agreed
[20:14:07] <effie>	 !log pool mw1262
[20:14:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:33] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "new key for bblack" [homer/public] - 10https://gerrit.wikimedia.org/r/575341 (owner: 10BBlack)
[20:14:50] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "new key for bblack" [homer/public] - 10https://gerrit.wikimedia.org/r/575341 (owner: 10BBlack)
[20:15:58] <wikibugs>	 (03CR) 10Jdlrobson: "I think this can land now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester)
[20:16:43] <logmsgbot>	 !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
[20:16:43] <logmsgbot>	 !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
[20:16:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:42] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[20:21:37] <logmsgbot>	 !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
[20:21:37] <logmsgbot>	 !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
[20:21:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:55] <wikibugs>	 (03PS9) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998
[20:22:04] <logmsgbot>	 !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' .
[20:22:04] <logmsgbot>	 !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' .
[20:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:11] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "> Patch Set 8:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester)
[20:22:17] <wikibugs>	 (03CR) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester)
[20:22:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:04] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt-wdqs1003 is CRITICAL: connect to address 10.64.20.46 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:24:36] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: Add cloudbackup servers to the ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/575351
[20:24:52] <icinga-wm>	 RECOVERY - Host cloudvirt-wdqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[20:26:01] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:26:04] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:48] <wikibugs>	 (03CR) 10Ayounsi: "Still have to run PCC, but this role is still WIP." [puppet] - 10https://gerrit.wikimedia.org/r/575342 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi)
[20:28:58] <icinga-wm>	 PROBLEM - configured eth on cloudvirt-wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[20:29:16] <icinga-wm>	 PROBLEM - Check systemd state on cloudvirt-wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:29:28] <icinga-wm>	 PROBLEM - dhclient process on cloudvirt-wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[20:29:34] <icinga-wm>	 PROBLEM - Disk space on cloudvirt-wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt-wdqs1003&var-datasource=eqiad+prometheus/ops
[20:29:52] <icinga-wm>	 PROBLEM - DPKG on cloudvirt-wdqs1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.46: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[20:30:33] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:30:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:13] <wikibugs>	 (03PS2) 10Bstorm: cloudstore: Add cloudbackup servers to the ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/575351
[20:32:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] cloudstore: Add cloudbackup servers to the ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/575351 (owner: 10Bstorm)
[20:32:57] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:32:59] <James_F>	 longma: OK for me to do a deploy?
[20:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:20] <hauskatze>	 RoanKattouw: Hi. Would it be possible to take a look at T244617? Thanks
[20:34:20] <stashbot>	 T244617: Please clear two stuck notifications for MABot - https://phabricator.wikimedia.org/T244617
[20:34:35] <longma>	 James_F: yeah go ahead
[20:34:42] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: Add cloudbackup servers to the ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/575351 (owner: 10Bstorm)
[20:34:50] <James_F>	 Excellent.
[20:34:56] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester)
[20:36:21] <wikibugs>	 (03Merged) 10jenkins-bot: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 (owner: 10Jforrester)
[20:50:58] <marostegui>	 AaronSchulz: I'm off today, sorry, let's try next week!
[20:53:40] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on cloudvirt-wdqs1003 is OK: OK: synced at Thu 2020-02-27 20:53:39 UTC. https://wikitech.wikimedia.org/wiki/NTP
[20:53:59] <wikibugs>	 (03PS1) 10Jforrester: wgLogos: Explicitly set 'wordmark' for all Wikipedias which over-ride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575352
[20:56:58] <wikibugs>	 (03PS2) 10Jforrester: wgLogos: Explicitly set 'wordmark' for all Wikipedias which over-ride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575352
[20:58:41] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wgLogos: Explicitly set 'wordmark' for all Wikipedias which over-ride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575352 (owner: 10Jforrester)
[20:59:42] <wikibugs>	 (03Merged) 10jenkins-bot: wgLogos: Explicitly set 'wordmark' for all Wikipedias which over-ride [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575352 (owner: 10Jforrester)
[21:02:46] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Merge wgMinervaCustomLogos into wgLogos (duration: 00m 57s)
[21:02:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:09] <wikibugs>	 10Operations, 10MediaWiki-General, 10observability, 10serviceops: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) One alternative is to adopt a sidecar in the form of statsd_exporter and have it do the heavy lifting of translating MediaWiki and MW Extension metrics in...
[21:04:09] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 00m 56s)
[21:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:21] <logmsgbot>	 !log jforrester@deploy1001 Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org)
[21:07:28] <James_F>	 Oh dear.
[21:07:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:08:16] <wikibugs>	 (03PS1) 10Volans: netbox: fine tune log and exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/575353
[21:10:00] * James_F pokes.
[21:10:26] <icinga-wm>	 PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:10:26] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:10:26] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:10:26] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:10:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:10:40] <icinga-wm>	 PROBLEM - Apache HTTP on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:10:42] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:10:45] <James_F>	 Yeah, sorry, this is me. Fixing now.
[21:10:46] <icinga-wm>	 PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1985 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:10:46] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:10:52] <James_F>	 Out-of-sequence deploy.
[21:10:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:10:56] <logmsgbot>	 !log jforrester@deploy1001 Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org)
[21:11:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:11:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:11:04] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:11:04] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:11:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:11:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:11:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:20] <icinga-wm>	 PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:11:20] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:11:22] <James_F>	 (Canaries are the ones upset.)
[21:11:24] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:11:26] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1985 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:11:41] <logmsgbot>	 !log jforrester@deploy1001 Synchronized multiversion/MWWikiversions.php: Drop references to four dblists (duration: 00m 35s)
[21:11:42] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:11:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:11:54] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:11:57] <logmsgbot>	 !log jforrester@deploy1001 sync-file aborted: Drop references to four dblists (duration: 00m 05s)
[21:12:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:08] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1985 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:12:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:12:08] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:12:24] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:12:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:12:46] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:12:48] <icinga-wm>	 RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 631 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:12:50] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.212 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:12:58] <icinga-wm>	 RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:13:05] <James_F>	 Sorry for the noise.
[21:13:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:13:08] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:13:08] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:13:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:13:08] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:13:08] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:13:21] <logmsgbot>	 !log jforrester@deploy1001 Synchronized dblists/: Add back the deleted dblists to make the canaries quiet (duration: 00m 56s)
[21:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:13:24] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:13:28] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:13:30] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 632 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:13:38] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix
[21:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:48] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:13:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:14:00] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 73003 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:14:16] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 73014 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:14:16] <icinga-wm>	 RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:14:16] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:14:32] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:14:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:14:38] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:14:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:14:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:14:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:14:45] <logmsgbot>	 !log jforrester@deploy1001 Synchronized multiversion/MWWikiversions.php: Drop references to four dblists to canaries too (duration: 00m 55s)
[21:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:08] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix (duration: 02m 30s)
[21:16:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:08] <logmsgbot>	 !log jforrester@deploy1001 Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org)
[21:19:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:51] <logmsgbot>	 !log jforrester@deploy1001 Scap failed!: 10/11 canaries failed their endpoint checks(http://en.wikipedia.org)
[21:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:20] <icinga-wm>	 PROBLEM - Apache HTTP on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:22:20] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:22:26] <icinga-wm>	 PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1985 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:22:28] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:22:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:22:39] <logmsgbot>	 !log jforrester@deploy1001 Scap failed!: 8/11 canaries failed their endpoint checks(http://en.wikipedia.org)
[21:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:22:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:22:44] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:22:44] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:22:44] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:22:44] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1980 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:22:45] <James_F>	 Eurgh.
[21:23:41] <James_F>	 I hate scap with the passion of a thousand suns.
[21:24:12] <logmsgbot>	 !log jforrester@deploy1001 Synchronized dblists/: Again, this time without blanked files (duration: 00m 56s)
[21:24:12] <stashbot>	 jforrester@deploy1001: Failed to log message to wiki. Somebody should check the error logs.
[21:24:14] <wikibugs>	 (03PS2) 10Aaron Schulz: Set "coalesceKeys" in mc.php to minimize host fan-out by WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575098
[21:24:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:24:32] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:24:36] <icinga-wm>	 RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 631 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:24:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:24:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:24:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:24:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:24:54] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:24:54] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:24:54] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:24:54] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 73004 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:24:54] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:25:21] <logmsgbot>	 !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Touch the dblists list (duration: 00m 56s)
[21:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:08] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:29:48] <wikibugs>	 10Operations, 10netbox: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10crusnov) Some notes from conversations about this:  - https://gerrit.wikimedia.org/r/c/operations/puppet/+/571486 is an example of CAS setup .   - We are in general agreement as to using apache to query CAS and the...
[21:30:42] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f4ae4e0d390: Failed to establish a new connection: [Errno 111] Connection
[21:30:42] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Search%23Administration
[21:31:38] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:31:46] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:31:48] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:32:58] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:33:28] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:35:04] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: relocating_shards: 0, unassigned_shards: 374, number_of_nodes: 6, cluster_name: production-logstash-eqiad, number_of_data_nodes: 3, active_shards: 750, task_max_waiting_in_queue_millis: 59399, active_shards_percent_as_number: 66.72597864768683, initializing_shards: 0, timed_out: False, status: yello
[21:35:04] <icinga-wm>	 light_fetch: 1122, number_of_pending_tasks: 52, delayed_unassigned_shards: 0, active_primary_shards: 484 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:35:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: initializing_shards: 0, timed_out: False, task_max_waiting_in_queue_millis: 56570, number_of_in_flight_fetch: 1122, active_primary_shards: 484, active_shards: 750, active_shards_percent_as_number: 66.72597864768683, delayed_unassigned_shards: 0, number_of_nodes: 6, cluster_name: production-logstash-
[21:35:06] <icinga-wm>	 pending_tasks: 36, number_of_data_nodes: 3, unassigned_shards: 374, status: yellow, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:35:17] <wikibugs>	 (03PS1) 10Jforrester: Revert "Merge wgMinervaCustomLogos into wgLogos" and follow-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575356
[21:35:34] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_nodes: 6, initializing_shards: 0, number_of_pending_tasks: 36, timed_out: False, number_of_data_nodes: 3, active_primary_shards: 484, active_shards: 750, status: yellow, cluster_name: production-logstash-eqiad, relocating_shards: 0, number_of_in_flight_fetch: 1122, active_shards_percent_as
[21:35:34] <icinga-wm>	 864768683, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 85255, unassigned_shards: 374 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:35:54] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: timed_out: False, initializing_shards: 0, relocating_shards: 0, unassigned_shards: 374, active_shards_percent_as_number: 66.72597864768683, cluster_name: production-logstash-eqiad, task_max_waiting_in_queue_millis: 105976, active_primary_shards: 484, delayed_unassigned_shards: 0, status: yellow, num
[21:35:54] <icinga-wm>	 number_of_pending_tasks: 33, number_of_data_nodes: 3, number_of_in_flight_fetch: 1122, active_shards: 750 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:36:00] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: task_max_waiting_in_queue_millis: 112780, number_of_data_nodes: 3, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 1122, number_of_pending_tasks: 33, status: yellow, active_primary_shards: 484, active_shards_percent_as_number: 66.72597864768683, initializing_shards: 0, timed_out: False, rel
[21:36:01] <icinga-wm>	 , active_shards: 750, cluster_name: production-logstash-eqiad, unassigned_shards: 374, number_of_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:37:06] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "Suspect flakiness." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575356 (owner: 10Jforrester)
[21:38:13] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Merge wgMinervaCustomLogos into wgLogos" and follow-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575356 (owner: 10Jforrester)
[21:39:14] <logmsgbot>	 !log jforrester@deploy1001 scap failed: average error rate on 11/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
[21:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:29] <James_F>	 Forcing.
[21:40:07] <logmsgbot>	 !log jforrester@deploy1001 Synchronized dblists/: Re-establish dblists everywhere (duration: 00m 33s)
[21:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:40] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 1051 threshold =0.34 breach: task_max_waiting_in_queue_millis: 81936, delayed_unassigned_shards: 0, number_of_nodes: 5, active_shards_percent_as_number: 6.494661921708185, timed_out: False, status: red, number_of_in_flight_fetch: 0, unassigned_shards: 1043, number_of_pending_tasks: 76, initializing_shards:
[21:41:40] <icinga-wm>	 a_nodes: 2, active_primary_shards: 65, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 73 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:41:40] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 1051 threshold =0.34 breach: status: red, cluster_name: production-logstash-eqiad, unassigned_shards: 1043, relocating_shards: 0, task_max_waiting_in_queue_millis: 83481, active_shards: 73, number_of_data_nodes: 2, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 6.494661921708185, number_of_
[21:41:40] <icinga-wm>	 of_pending_tasks: 78, active_primary_shards: 65, initializing_shards: 8, delayed_unassigned_shards: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:42:06] <logmsgbot>	 !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Use the four dblists again (duration: 00m 33s)
[21:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:10] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 994 threshold =0.34 breach: number_of_nodes: 5, status: red, number_of_pending_tasks: 78, initializing_shards: 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 112472, delayed_unassigned_shards: 0, active_shards: 130, unassigned_shards: 986, relocating_shards: 0, number_of_data_nodes: 2, 
[21:42:10] <icinga-wm>	 cent_as_number: 11.565836298932384, timed_out: False, cluster_name: production-logstash-eqiad, active_primary_shards: 117 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:42:32] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 958 threshold =0.34 breach: active_shards: 166, cluster_name: production-logstash-eqiad, number_of_in_flight_fetch: 0, initializing_shards: 8, relocating_shards: 0, unassigned_shards: 950, number_of_data_nodes: 2, number_of_nodes: 5, active_primary_shards: 153, timed_out: False, task_max_waiting_in_queue_m
[21:42:32] <icinga-wm>	 tive_shards_percent_as_number: 14.768683274021353, status: red, delayed_unassigned_shards: 0, number_of_pending_tasks: 75 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:42:38] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 947 threshold =0.34 breach: unassigned_shards: 939, timed_out: False, active_shards_percent_as_number: 15.747330960854091, task_max_waiting_in_queue_millis: 141613, relocating_shards: 0, delayed_unassigned_shards: 0, cluster_name: production-logstash-eqiad, status: red, number_of_pending_tasks: 111, number
[21:42:38] <icinga-wm>	 ive_primary_shards: 164, number_of_in_flight_fetch: 0, active_shards: 177, initializing_shards: 8, number_of_data_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:43:45] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Roll back to setting wgMinervaCustomLogos (duration: 00m 33s)
[21:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:44:03] <wikibugs>	 (03CR) 10C. Scott Ananian: "Largely similar to I892ece88dd56af2758712b0960a62be7a4370715 but LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester)
[21:44:36] <James_F>	 OK, prod clean and I'm stopping for a bit.
[21:44:43] * James_F sighs at ES.
[21:46:10] <wikibugs>	 (03CR) 10C. Scott Ananian: [C: 03+1] Parsoid: Use the version of Parsoid in $IP/vendor (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester)
[21:46:16] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:46:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:46:24] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:46:34] <wikibugs>	 (03Abandoned) 10C. Scott Ananian: Load Parsoid from the vendor repo, not from an ad-hoc deploy dir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572051 (owner: 10C. Scott Ananian)
[21:46:38] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:46:54] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:47:04] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:47:12] <icinga-wm>	 PROBLEM - logstash process on logstash2005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash
[21:47:58] <icinga-wm>	 PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:58] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet, logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:48:04] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:48:04] <icinga-wm>	 PROBLEM - logstash process on logstash1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash
[21:48:08] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:48:08] <icinga-wm>	 PROBLEM - logstash process on logstash2006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash
[21:48:14] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_nodes: 5, number_of_pending_tasks: 0, active_shards: 744, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, timed_out: False, active_primary_shards: 484, number_of_data_nodes: 2, unassigned_shards: 372, cluster_name: production-logstash-eqiad, number_of_in_flight_fetch: 0, initial
[21:48:14] <icinga-wm>	 active_shards_percent_as_number: 66.19217081850533, delayed_unassigned_shards: 0, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:48:16] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 5, initializing_shards: 8, task_max_waiting_in_queue_millis: 0, number_of_data_nodes: 2, active_primary_shards: 484, relocating_shards: 0, cluster_name: production-logstash-eqiad, number_of_pending_tasks: 0, timed_out: False, active_shards: 744, active_shards_percent
[21:48:16] <icinga-wm>	 217081850533, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 0, unassigned_shards: 372 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:48:36] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[21:48:44] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards: 746, number_of_nodes: 5, relocating_shards: 0, active_primary_shards: 484, cluster_name: production-logstash-eqiad, number_of_data_nodes: 2, number_of_pending_tasks: 0, timed_out: False, delayed_unassigned_shards: 0, u
[21:48:44] <icinga-wm>	  370, active_shards_percent_as_number: 66.37010676156584, initializing_shards: 8, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:48:50] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix take 2
[21:48:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: relocating_shards: 0, status: yellow, initializing_shards: 8, number_of_data_nodes: 2, active_primary_shards: 484, task_max_waiting_in_queue_millis: 0, active_shards: 747, timed_out: False, number_of_nodes: 5, cluster_name: production-logstash-eqiad, unassigned_shards: 369, active_shards_percent_as_
[21:49:06] <icinga-wm>	 73309609, number_of_in_flight_fetch: 0, number_of_pending_tasks: 0, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:49:14] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_pending_tasks: 0, status: yellow, number_of_nodes: 5, number_of_data_nodes: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, active_primary_shards: 484, unassigned_shards: 369, active_shards: 747, cluster_name: production-logstash-eqiad, delayed_u
[21:49:14] <icinga-wm>	  0, timed_out: False, initializing_shards: 8, active_shards_percent_as_number: 66.45907473309609 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:49:16] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[21:49:26] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[21:49:33] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew)
[21:50:03] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew) 05Open→03Resolved I have an OS installed on all three of these hosts and I'm experimenting on them in the cloud-v...
[21:50:06] <wikibugs>	 10Operations, 10DC-Ops, 10hardware-requests: eqiad: three clouvirt-wdqs servers for WDQS testing - https://phabricator.wikimedia.org/T232654 (10Andrew)
[21:50:24] <shdubsh>	 !log start elasticsearch on logastash1010
[21:50:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:49] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@c70b338]: AQS: Minor fix take 2 (duration: 02m 59s)
[21:51:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:14] <icinga-wm>	 PROBLEM - Check systemd state on logstash1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:52:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:52:28] <icinga-wm>	 RECOVERY - logstash process on logstash1009 is OK: PROCS OK: 1 process with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash
[21:52:31] <wikibugs>	 (03PS1) 10Herron: Revert "hieradata: send mw1262's apache logs to logstash" [puppet] - 10https://gerrit.wikimedia.org/r/575358
[21:52:43] <effie>	 herron: better depool it 
[21:52:54] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:52:54] <effie>	 reverting it will not delete the file 
[21:52:57] <effie>	 I'll do it 
[21:53:14] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[21:53:31] <herron>	 effie: ok thanks
[21:53:36] <effie>	 !log depool mw1262, suspecting it might have overloaded logstash 
[21:53:38] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1010 is OK: OK - elasticsearch status production-logstash-eqiad: delayed_unassigned_shards: 0, active_shards: 748, number_of_data_nodes: 3, status: yellow, cluster_name: production-logstash-eqiad, number_of_in_flight_fetch: 0, initializing_shards: 5, unassigned_shards: 371, active_primary_shards: 484, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, num
[21:53:38] <icinga-wm>	 timed_out: False, active_shards_percent_as_number: 66.54804270462633, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:53:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:40] <wikibugs>	 (03Abandoned) 10Herron: Revert "hieradata: send mw1262's apache logs to logstash" [puppet] - 10https://gerrit.wikimedia.org/r/575358 (owner: 10Herron)
[21:53:50] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[21:54:34] <icinga-wm>	 RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:59:23] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@5a67e6e]: AQS: Minor fix take 3
[21:59:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:40] <wikibugs>	 (03PS1) 10Jforrester: tests: Assert the 'wordmark' config set-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575361
[22:01:43] <wikibugs>	 (03PS1) 10Jforrester: Only try to set wgLogos['wordmark'] if not already done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575362
[22:01:45] <wikibugs>	 (03PS1) 10Jforrester: Re-try "Merge wgMinervaCustomLogos into wgLogos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575363
[22:01:47] <wikibugs>	 (03PS1) 10Jforrester: Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575364
[22:01:49] <wikibugs>	 (03PS1) 10Jforrester: Stop loading 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575365
[22:01:51] <wikibugs>	 (03PS1) 10Jforrester: Stop defining 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575366
[22:03:26] <icinga-wm>	 RECOVERY - logstash process on logstash2006 is OK: PROCS OK: 1 process with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash
[22:03:48] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[22:05:38] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[22:06:47] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@5a67e6e]: AQS: Minor fix take 3 (duration: 07m 24s)
[22:06:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:29] <wikibugs>	 (03PS2) 10Jforrester: Re-try "Merge wgMinervaCustomLogos into wgLogos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575363
[22:07:31] <wikibugs>	 (03PS2) 10Jforrester: Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575364
[22:07:33] <wikibugs>	 (03PS2) 10Jforrester: Stop loading 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575365
[22:07:35] <wikibugs>	 (03PS2) 10Jforrester: Stop defining 'wikipedia-english', 'wikipedia-e-acute', 'wikipedia-cyrillic', 'wikipedia-devanagari' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575366
[22:08:03] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] tests: Assert the 'wordmark' config set-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575361 (owner: 10Jforrester)
[22:08:48] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[22:09:08] <icinga-wm>	 RECOVERY - logstash process on logstash2005 is OK: PROCS OK: 1 process with UID = 498 (logstash), command name java, args logstash https://wikitech.wikimedia.org/wiki/Logstash
[22:09:56] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[22:10:12] <wikibugs>	 (03Merged) 10jenkins-bot: tests: Assert the 'wordmark' config set-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575361 (owner: 10Jforrester)
[22:21:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: audit/rebalance power in a5-eqiad - https://phabricator.wikimedia.org/T245655 (10ayounsi) I disabled alerting for that host as it has been alerting/flapping regularly.  To be turned back on when fixed: https://librenms.wikimedia.org/device/device=41/tab=edit/
[22:39:04] <wikibugs>	 (03CR) 10C. Scott Ananian: [C: 03+1] Parsoid: Use the version of Parsoid in $IP/vendor (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester)
[22:41:13] <wikibugs>	 (03CR) 10Subramanya Sastry: Parsoid: Use the version of Parsoid in $IP/vendor (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester)
[22:49:33] <James_F>	 !log Manually `scap pull`ed on mw1349 and mw1351 as they were emitting odd errors.
[22:49:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:17] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Only try to set wgLogos['wordmark'] if not already done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575362 (owner: 10Jforrester)
[22:59:35] <wikibugs>	 (03Merged) 10jenkins-bot: Only try to set wgLogos['wordmark'] if not already done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575362 (owner: 10Jforrester)
[23:01:05] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Only try to set wgLogos['wordmark'] if not already done (duration: 00m 58s)
[23:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:26] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Re-try "Merge wgMinervaCustomLogos into wgLogos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575363 (owner: 10Jforrester)
[23:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: Re-try "Merge wgMinervaCustomLogos into wgLogos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575363 (owner: 10Jforrester)
[23:04:55] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Merge wgMinervaCustomLogos into wgLogos, take 2 (duration: 00m 56s)
[23:04:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:05:45] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575364 (owner: 10Jforrester)
[23:06:42] <wikibugs>	 (03Merged) 10jenkins-bot: Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575364 (owner: 10Jforrester)
[23:07:30] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bonus sync for cache clearance (duration: 00m 56s)
[23:07:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:19] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop setting wgLogos['wordmark'] based on wgMinervaCustomLogos, never set (duration: 00m 56s)
[23:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:01] <wikibugs>	 (03PS1) 10Jdlrobson: Drop legacy main page special casing on select projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575376 (https://phabricator.wikimedia.org/T32405)
[23:28:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Drop legacy main page special casing on select projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575376 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson)
[23:45:35] <wikibugs>	 (03PS3) 10Jforrester: Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055)
[23:46:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester)
[23:48:27] <wikibugs>	 (03PS4) 10Jforrester: Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055)
[23:53:02] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[23:53:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:59:43] <wikibugs>	 (03Abandoned) 10saper: Wikistats v2: go live [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper)