[00:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T0000).
[00:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[00:06:20] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 4:" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov)
[00:10:28] <urandom>	 !log bootstrapping cassandra-c, restbase2018 -- T210843
[00:10:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:31] <stashbot>	 T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843
[00:11:11] <icinga-wm>	 RECOVERY - Check systemd state on restbase2018 is OK: OK - running: The system is fully operational
[00:11:22] <wikibugs>	 (03PS1) 10Tim Starling: Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939
[00:11:35] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2018 is OK: OK - cassandra-c is active
[00:11:57] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.48.126:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-c valid until 2020-11-29 09:26:22 +0000 (expires in 724 days)
[00:12:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling)
[00:12:31] <wikibugs>	 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans)
[00:13:18] <mutante>	 !log re-enabling puppet on phabricator, applying change that adds php-fpm support on stretch ..which doesnt affect phab1001 (prod) on jessie.. BUT re-adds tuning config from the past for mpm_prefork.conf (more SpareServers etc) that was not actually applied due to a bug
[00:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:21] <wikibugs>	 (03PS1) 10Bstorm: sonofgridengine: grant more control over shadowd using the env settings [puppet] - 10https://gerrit.wikimedia.org/r/477940 (https://phabricator.wikimedia.org/T211258)
[00:15:32] <mutante>	 !log MPM prefork tweaks for high load systems are applied again (apparently they were not since a change in the past that resulted in 2 competing configs in mods-enabled and conf-enabled with the latter one being loaded last and containing the package defaults
[00:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:42] <wikibugs>	 (03CR) 10Bstorm: [C: 032] sonofgridengine: grant more control over shadowd using the env settings [puppet] - 10https://gerrit.wikimedia.org/r/477940 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm)
[00:18:08] <wikibugs>	 (03CR) 10Tim Starling: "Is there a reason we are running a PHP 5.5 lint against this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling)
[00:20:11] <wikibugs>	 (03CR) 10Tim Starling: [C: 032] "I'm thinking about ways to abstract env/services in the context of https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/477939" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle)
[00:20:25] <wikibugs>	 (03PS5) 10Paladox: profile::phabricator::httpd: Fix worker configs and also use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925
[00:21:22] <wikibugs>	 (03Merged) 10jenkins-bot: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle)
[00:22:31] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Dzahn) While merging and confirming https://gerrit.wiki...
[00:25:47] <wikibugs>	 (03PS6) 10Paladox: profile::phabricator::httpd: Fix worker configs and also use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925
[00:25:53] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/477925 (owner: 10Paladox)
[00:27:47] <wikibugs>	 (03CR) 10jenkins-bot: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle)
[00:31:24] <wikibugs>	 (03CR) 10Tim Starling: [C: 032] "I want to use this right now to test https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/467239/ which I just merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle)
[00:32:30] <wikibugs>	 (03Merged) 10jenkins-bot: Create script to intentionally trigger fatal errors in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle)
[00:36:23] <wikibugs>	 (03PS6) 10CRusnov: Add an old hardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899)
[00:37:30] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Code and compiler both looks good. Just a couple of minor details inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477845 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov)
[00:38:14] <logmsgbot>	 !log tstarling@deploy1001 Synchronized private/FatalErrorSettings.php: (no justification provided) (duration: 00m 46s)
[00:38:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:19] <wikibugs>	 (03PS7) 10CRusnov: Add an old hardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899)
[00:40:52] <wikibugs>	 (03CR) 10jenkins-bot: Create script to intentionally trigger fatal errors in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle)
[00:40:56] <logmsgbot>	 !log tstarling@deploy1001 Synchronized private/FatalErrorSettings.php: (no justification provided) (duration: 00m 46s)
[00:40:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:42:15] <logmsgbot>	 !log tstarling@deploy1001 Synchronized w/fatal-error.php: (no justification provided) (duration: 00m 46s)
[00:42:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:45:01] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM. I agree with Faidon, better run this at least once and then decide based on the results what we want to do." (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov)
[00:50:12] <wikibugs>	 (03PS1) 10Tim Starling: In fatal-error.php remove incorrect $wiki parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477944
[00:52:16] <wikibugs>	 (03CR) 10Tim Starling: [C: 032] In fatal-error.php remove incorrect $wiki parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477944 (owner: 10Tim Starling)
[00:52:34] <wikibugs>	 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) Please note the captive nuts for this arrived to my place today, and the brackets are on site.  I can now start on this process of swapping the PDUS over.  I'll go in...
[00:53:20] <wikibugs>	 (03Merged) 10jenkins-bot: In fatal-error.php remove incorrect $wiki parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477944 (owner: 10Tim Starling)
[00:55:07] <logmsgbot>	 !log tstarling@deploy1001 Synchronized w/fatal-error.php: (no justification provided) (duration: 00m 47s)
[00:55:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:55:47] <wikibugs>	 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH)
[00:59:35] <wikibugs>	 (03PS1) 10Tim Starling: Constants must be scalar in HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477945
[01:00:04] <jouncebot>	 twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T0100).
[01:01:09] <wikibugs>	 (03CR) 10jenkins-bot: In fatal-error.php remove incorrect $wiki parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477944 (owner: 10Tim Starling)
[01:01:24] <wikibugs>	 (03CR) 10Tim Starling: [C: 032] Constants must be scalar in HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477945 (owner: 10Tim Starling)
[01:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: Constants must be scalar in HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477945 (owner: 10Tim Starling)
[01:03:29] <logmsgbot>	 !log tstarling@deploy1001 Synchronized w/fatal-error.php: (no justification provided) (duration: 00m 46s)
[01:03:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:13:19] <wikibugs>	 (03CR) 10Tim Starling: [C: 032] "I tested this in production by inducing 10 fatal errors with /w/fatal-error.php and verifying that the count went up accordingly in Grafan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle)
[01:21:48] <wikibugs>	 (03CR) 10jenkins-bot: Constants must be scalar in HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477945 (owner: 10Tim Starling)
[02:02:25] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.48.126:9042 on restbase2018 is OK: TCP OK - 0.036 second response time on 10.192.48.126 port 9042
[02:17:11] <icinga-wm>	 PROBLEM - Apache HTTP on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:17:21] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[02:17:23] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[02:17:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:29] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[02:17:31] <icinga-wm>	 PROBLEM - HTTP-dbtree on dbmonitor2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:17:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:37] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[02:17:37] <icinga-wm>	 PROBLEM - DPKG on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:17:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:45] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer
[02:17:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:49] <icinga-wm>	 PROBLEM - dhclient process on proton2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.75: Connection reset by peer
[02:17:51] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[02:17:51] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[02:17:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:17:57] <icinga-wm>	 PROBLEM - Check systemd state on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer
[02:17:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:01] <icinga-wm>	 PROBLEM - HHVM rendering on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:18:05] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[02:18:07] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:11] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:13] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:18:15] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:23] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:18:25] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[02:18:31] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:18:31] <icinga-wm>	 PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer
[02:18:33] <icinga-wm>	 PROBLEM - puppet last run on mw2264 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:18:55] <icinga-wm>	 PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer
[02:18:55] <icinga-wm>	 PROBLEM - Check size of conntrack table on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer
[02:19:05] <icinga-wm>	 PROBLEM - ganeti-mond running on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer
[02:19:25] <icinga-wm>	 PROBLEM - dhclient process on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer
[02:19:27] <icinga-wm>	 PROBLEM - Check systemd state on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer
[02:19:27] <icinga-wm>	 PROBLEM - Check size of conntrack table on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:19:39] <icinga-wm>	 PROBLEM - mcrouter process on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:19:42] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:19:42] <icinga-wm>	 PROBLEM - Disk space on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:19:45] <icinga-wm>	 PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer
[02:19:53] <icinga-wm>	 PROBLEM - puppet last run on conf2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:19:59] <icinga-wm>	 PROBLEM - configured eth on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:20:01] <icinga-wm>	 PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer
[02:20:09] <icinga-wm>	 PROBLEM - HHVM processes on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:20:13] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer
[02:20:17] <icinga-wm>	 PROBLEM - php7.2-fpm service on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:20:21] <icinga-wm>	 PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer
[02:20:23] <icinga-wm>	 PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer
[02:20:25] <icinga-wm>	 PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:21:05] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:21:07] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer
[02:21:23] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer
[02:21:29] <icinga-wm>	 PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer
[02:21:31] <icinga-wm>	 PROBLEM - puppet last run on mw2261 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:21:43] <icinga-wm>	 PROBLEM - DPKG on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer
[02:21:47] <icinga-wm>	 PROBLEM - puppet last run on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer
[02:21:59] <icinga-wm>	 PROBLEM - Check systemd state on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer
[02:22:09] <icinga-wm>	 PROBLEM - mcrouter process on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:22:13] <icinga-wm>	 PROBLEM - Disk space on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer
[02:22:25] <icinga-wm>	 PROBLEM - Check size of conntrack table on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer
[02:22:25] <icinga-wm>	 PROBLEM - dhclient process on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer
[02:22:31] <icinga-wm>	 PROBLEM - configured eth on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:22:33] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer
[02:22:35] <icinga-wm>	 PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer
[02:22:35] <icinga-wm>	 PROBLEM - DPKG on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer
[02:22:41] <icinga-wm>	 PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer
[02:22:51] <icinga-wm>	 PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer
[02:22:53] <icinga-wm>	 PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:23:09] <icinga-wm>	 RECOVERY - Check systemd state on alsafi is OK: OK - running: The system is fully operational
[02:23:09] <icinga-wm>	 RECOVERY - dhclient process on ganeti2003 is OK: PROCS OK: 0 processes with command name dhclient
[02:23:11] <icinga-wm>	 RECOVERY - Check size of conntrack table on mwdebug2002 is OK: OK: nf_conntrack is 0 % full
[02:23:17] <icinga-wm>	 RECOVERY - Apache HTTP on mwdebug2002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.116 second response time
[02:23:17] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mwdebug2002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.204 second response time
[02:23:17] <icinga-wm>	 PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:23:17] <icinga-wm>	 RECOVERY - mcrouter process on mwdebug2002 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter
[02:23:19] <icinga-wm>	 RECOVERY - Disk space on dbmonitor2001 is OK: DISK OK
[02:23:19] <icinga-wm>	 RECOVERY - Disk space on mwdebug2002 is OK: DISK OK
[02:23:23] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mwdebug2002 is OK: OK ferm input default policy is set
[02:23:23] <icinga-wm>	 RECOVERY - Disk space on pybal-test2001 is OK: DISK OK
[02:23:23] <icinga-wm>	 RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient
[02:23:29] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational
[02:23:29] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set
[02:23:29] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[02:23:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy
[02:23:31] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy
[02:23:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy
[02:23:35] <icinga-wm>	 RECOVERY - dhclient process on dbmonitor2001 is OK: PROCS OK: 0 processes with command name dhclient
[02:23:35] <icinga-wm>	 RECOVERY - Check size of conntrack table on ganeti2003 is OK: OK: nf_conntrack is 0 % full
[02:23:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy
[02:23:37] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy
[02:23:37] <icinga-wm>	 RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 78891 bytes in 1.076 second response time
[02:23:39] <icinga-wm>	 RECOVERY - configured eth on mwdebug2002 is OK: OK - interfaces up
[02:23:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy
[02:23:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy
[02:23:41] <icinga-wm>	 RECOVERY - ganeti-noded running on ganeti2003 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded
[02:23:43] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on dbmonitor2001 is OK: OK ferm input default policy is set
[02:23:43] <icinga-wm>	 RECOVERY - DPKG on alsafi is OK: All packages OK
[02:23:43] <icinga-wm>	 RECOVERY - DPKG on mwdebug2002 is OK: All packages OK
[02:23:44] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[02:23:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy
[02:23:47] <icinga-wm>	 RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up
[02:23:49] <icinga-wm>	 RECOVERY - Check size of conntrack table on dbmonitor2001 is OK: OK: nf_conntrack is 0 % full
[02:23:49] <icinga-wm>	 RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full
[02:23:49] <icinga-wm>	 RECOVERY - HHVM processes on mwdebug2002 is OK: PROCS OK: 6 processes with command name hhvm
[02:23:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy
[02:23:53] <icinga-wm>	 RECOVERY - Check systemd state on ganeti2003 is OK: OK - running: The system is fully operational
[02:23:57] <icinga-wm>	 RECOVERY - dhclient process on proton2002 is OK: PROCS OK: 0 processes with command name dhclient
[02:23:57] <icinga-wm>	 RECOVERY - php7.2-fpm service on mwdebug2002 is OK: OK - php7.2-fpm is active
[02:23:59] <icinga-wm>	 RECOVERY - ganeti-mond running on ganeti2003 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond
[02:23:59] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy
[02:23:59] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[02:23:59] <icinga-wm>	 RECOVERY - configured eth on alsafi is OK: OK - interfaces up
[02:24:01] <icinga-wm>	 RECOVERY - DPKG on pybal-test2001 is OK: All packages OK
[02:24:03] <icinga-wm>	 RECOVERY - Check systemd state on dbmonitor2001 is OK: OK - running: The system is fully operational
[02:24:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy
[02:24:05] <icinga-wm>	 RECOVERY - HHVM rendering on mwdebug2002 is OK: HTTP OK: HTTP/1.1 200 OK - 76170 bytes in 0.304 second response time
[02:24:05] <icinga-wm>	 RECOVERY - DPKG on dbmonitor2001 is OK: All packages OK
[02:24:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy
[02:24:11] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy
[02:24:13] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy
[02:24:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy
[02:24:17] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug2002 is OK: HTTP OK: HTTP/1.1 200 OK - 76210 bytes in 0.266 second response time
[02:24:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy
[02:24:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy
[02:24:19] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy
[02:24:23] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy
[02:24:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy
[02:24:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy
[02:24:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy
[02:24:35] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy
[02:25:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy
[02:25:09] <icinga-wm>	 RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures
[02:25:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy
[02:25:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy
[02:26:53] <icinga-wm>	 RECOVERY - puppet last run on ganeti2003 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures
[02:37:54] <greg-g>	 eh?
[02:39:04] <Pchelolo>	 the citoid/restbase ones were semi-expected, it's been flapping since switch to kubernetes, Marko&Alex are working on that
[02:39:19] <icinga-wm>	 RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational
[02:40:10] <Pchelolo>	 the mwdebug ones I do not know about
[02:41:44] <greg-g>	 thanks Pchelolo 
[02:46:03] <icinga-wm>	 RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:46:33] <icinga-wm>	 RECOVERY - puppet last run on es2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:47:37] <icinga-wm>	 RECOVERY - puppet last run on mw2261 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[02:48:59] <icinga-wm>	 RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[02:49:43] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[02:49:55] <icinga-wm>	 RECOVERY - puppet last run on mw2264 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[02:56:42] <wikibugs>	 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans)
[02:56:53] <wikibugs>	 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans)
[03:09:16] <wikibugs>	 10Operations, 10Community-Tech, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Mathew.onipe)
[03:10:33] <wikibugs>	 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Mathew.onipe)
[03:15:18] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, 10Multimedia: Cron <www-data@mwmaint1002> /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T211270 (10Mathew.onipe)
[03:21:25] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Cloud-Services: Cron <root@labweb1001> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211271 (10Mathew.onipe)
[03:23:04] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Cloud-Services: Cron <root@labweb1001> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211271 (10Mathew.onipe) p:05Triage>03Normal
[03:25:40] <wikibugs>	 10Operations, 10MediaWiki-API: Cron <root@mw2182> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211272 (10Mathew.onipe)
[03:26:10] <wikibugs>	 10Operations, 10MediaWiki-API: Cron <root@mw2182> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211272 (10Mathew.onipe) p:05Triage>03Normal
[03:28:18] <wikibugs>	 10Operations, 10MediaWiki-Debug-Logger: Cron <root@mwdebug2002> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211273 (10Mathew.onipe)
[03:28:24] <wikibugs>	 10Operations, 10MediaWiki-Debug-Logger: Cron <root@mwdebug2002> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211273 (10Mathew.onipe) p:05Triage>03Normal
[03:37:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 886.59 seconds
[03:44:54] <wikibugs>	 (03PS2) 10Tim Starling: Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939
[03:44:56] <wikibugs>	 (03PS1) 10Tim Starling: Class wrapper for ProductionServices.php etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956
[03:44:58] <wikibugs>	 (03PS1) 10Tim Starling: Put profiler hostnames in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957
[03:45:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling)
[03:46:40] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe)
[03:47:00] <wikibugs>	 (03PS2) 10Mathew.onipe: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023)
[03:54:06] <wikibugs>	 (03PS1) 10Mathew.onipe: setup: change curator version to 4.2.5 to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958
[04:56:05] <wikibugs>	 (03PS1) 10Stella: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618)
[04:56:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella)
[04:58:44] <wikibugs>	 (03PS2) 10Stella: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618)
[05:19:45] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 187.30 seconds
[05:27:21] <wikibugs>	 10Operations, 10MediaWiki-API: Cron <root@mw2182> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211272 (10Legoktm) @Mathew.onipe any reason you added #mediawiki-api ? I don't see the relevance.
[05:31:54] <wikibugs>	 10Operations: Cron <root@mw2182> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211272 (10Mathew.onipe)
[05:32:26] <wikibugs>	 10Operations: Cron <root@mw2182> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211272 (10Mathew.onipe) My bad. Removed.  Thanks
[05:47:58] <wikibugs>	 (03PS5) 10Fomafix: Add language codes sr-cyrl and sr-latn next to sr-ec and sr-el [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845)
[06:12:48] <wikibugs>	 (03CR) 10Legoktm: "> LGTM, not merging right now as I don't know the status of the related code in prod, but feel free to ping if you need it merged." [puppet] - 10https://gerrit.wikimedia.org/r/475579 (https://phabricator.wikimedia.org/T210312) (owner: 10Legoktm)
[06:29:33] <icinga-wm>	 PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:33:41] <wikibugs>	 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10User-ArielGlenn: Correctly collect logs from php-fpm pools - https://phabricator.wikimedia.org/T211184 (10Joe)
[06:39:21] <icinga-wm>	 RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational
[06:40:42] <wikibugs>	 10Operations, 10Traffic, 10HTTPS, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Liuxinyu970226) @Krenair @Bawolff @jcrespo Wondering if we can enable QUIC support on our server clusters instead? I've heard that the [[https://github.com/googlehosts|Googl...
[06:58:08] <wikibugs>	 (03CR) 10Elukey: "Going to delete these crons since they are not used anymore (see the ensure => absent) and they are confusing, feel free to drop this code" [puppet] - 10https://gerrit.wikimedia.org/r/477818 (owner: 10Fdans)
[07:00:50] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::refinery::job::sqoop_mediawiki: remove crons [puppet] - 10https://gerrit.wikimedia.org/r/477962
[07:12:52] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211278 (10ops-monitoring-bot)
[07:25:15] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211279 (10ops-monitoring-bot)
[07:29:41] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::sqoop_mediawiki: remove crons [puppet] - 10https://gerrit.wikimedia.org/r/477962 (owner: 10Elukey)
[07:34:12] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211281 (10ops-monitoring-bot)
[07:35:08] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on stat1004 is CRITICAL: cluster=analytics device=sde instance=stat1004:9100 job=node site=eqiad Muehlenhoff /dev/sde is a USB drive which is temporarily attached and which smartctl cant parse with its default settings. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004var-datasource=eqiad%2520prometheus%252Fops
[07:38:33] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211282 (10ops-monitoring-bot)
[07:40:04] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[07:41:43] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211283 (10ops-monitoring-bot)
[07:42:33] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy
[07:45:34] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211284 (10ops-monitoring-bot)
[07:48:10] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211285 (10ops-monitoring-bot)
[07:53:19] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211286 (10ops-monitoring-bot)
[07:55:52] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211287 (10ops-monitoring-bot)
[07:58:21] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove Diamond on maps servers [puppet] - 10https://gerrit.wikimedia.org/r/477748 (https://phabricator.wikimedia.org/T183454)
[08:03:28] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211288 (10ops-monitoring-bot)
[08:03:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:04:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time
[08:09:17] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211289 (10ops-monitoring-bot)
[08:12:19] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211290 (10ops-monitoring-bot)
[08:17:22] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211291 (10ops-monitoring-bot)
[08:18:50] <icinga-wm>	 PROBLEM - puppet last run on an-worker1083 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[spark2_yarn_shuffle_jar_install],Package[hadoop-client],Package[libhdfs0]
[08:19:23] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::spark2: explicitly require hadoop-yarn-nodemanager [puppet] - 10https://gerrit.wikimedia.org/r/477965
[08:20:33] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211292 (10ops-monitoring-bot)
[08:21:36] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hadoop::spark2: explicitly require hadoop-yarn-nodemanager [puppet] - 10https://gerrit.wikimedia.org/r/477965 (owner: 10Elukey)
[08:24:50] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211293 (10ops-monitoring-bot)
[08:25:10] <icinga-wm>	 PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:26:01] <elukey>	 this is my fault --^
[08:28:12] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:28:56] <icinga-wm>	 RECOVERY - puppet last run on an-worker1083 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[08:30:16] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove Diamond on maps servers [puppet] - 10https://gerrit.wikimedia.org/r/477748 (https://phabricator.wikimedia.org/T183454)
[08:30:50] <wikibugs>	 10Operations, 10Traffic, 10HTTPS, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Bawolff) >>! In T205378#4802463, @Liuxinyu970226 wrote: > @Krenair @Bawolff @jcrespo Wondering if we can enable QUIC support on our server clusters instead? I've heard that...
[08:31:35] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211295 (10ops-monitoring-bot)
[08:31:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove Diamond on maps servers [puppet] - 10https://gerrit.wikimedia.org/r/477748 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[08:36:32] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::spark2: fix dependency on exec [puppet] - 10https://gerrit.wikimedia.org/r/477968
[08:37:40] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211296 (10ops-monitoring-bot)
[08:39:03] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hadoop::spark2: fix dependency on exec [puppet] - 10https://gerrit.wikimedia.org/r/477968 (owner: 10Elukey)
[08:43:44] <icinga-wm>	 PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:45:50] <icinga-wm>	 RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[08:48:56] <icinga-wm>	 RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[08:49:54] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211298 (10ops-monitoring-bot)
[08:54:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "Due to SWAT policies, InitialiseSettings.php must be updated in another patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella)
[08:57:39] <wikibugs>	 10Operations, 10Icinga, 10Scoring-platform-team, 10Patch-For-Review: Add ahalfaker to ORES-related icinga contacts - https://phabricator.wikimedia.org/T210742 (10Halfak) Thank you!
[08:59:41] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[09:00:08] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211299 (10ops-monitoring-bot)
[09:03:20] <wikibugs>	 (03CR) 10Muehlenhoff: "I think there's been some general confusion where the metrics are supposed to end up: So far labmon has been mostly used to collect metric" [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite)
[09:06:28] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211300 (10ops-monitoring-bot)
[09:10:04] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::common: remove explicit dep ordering [puppet] - 10https://gerrit.wikimedia.org/r/477970
[09:10:15] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211301 (10ops-monitoring-bot)
[09:12:08] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hadoop::common: remove explicit dep ordering [puppet] - 10https://gerrit.wikimedia.org/r/477970 (owner: 10Elukey)
[09:12:57] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211302 (10ops-monitoring-bot)
[09:28:35] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211303 (10ops-monitoring-bot)
[09:29:41] <wikibugs>	 (03CR) 10Gehel: "LGTM, minor comment inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 (owner: 10Mathew.onipe)
[09:32:50] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "LGTM, let's wait until elasticsearch is shutdown on all those servers to merge it" [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe)
[09:33:15] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211304 (10ops-monitoring-bot)
[09:38:07] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211307 (10ops-monitoring-bot)
[09:39:38] <wikibugs>	 (03PS1) 10Ema: ATS: define global Lua scripts in plugin.config [puppet] - 10https://gerrit.wikimedia.org/r/477974 (https://phabricator.wikimedia.org/T207048)
[09:42:07] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::common: add ordering for apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/477975
[09:44:15] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel)
[09:44:28] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hadoop::common: add ordering for apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/477975 (owner: 10Elukey)
[09:45:52] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel) All servers configured.  @Papaul I'm not sure if you need to track anything else on this task, but from my side, it can be closed.
[09:48:54] <wikibugs>	 (03PS2) 10Ema: ATS: define global Lua scripts in plugin.config [puppet] - 10https://gerrit.wikimedia.org/r/477974 (https://phabricator.wikimedia.org/T207048)
[09:52:44] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754
[09:53:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris)
[09:53:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris)
[09:56:02] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754
[09:56:28] <dcausse>	 !log elastic@codfw cleanup: deleting wikidatawiki_content_1537469318 index (failed reindex probably)
[09:56:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris)
[09:57:45] <wikibugs>	 (03PS3) 10Ema: ATS: define global Lua scripts in plugin.config [puppet] - 10https://gerrit.wikimedia.org/r/477974 (https://phabricator.wikimedia.org/T207048)
[09:59:06] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211308 (10ops-monitoring-bot)
[09:59:44] <wikibugs>	 (03CR) 10Muehlenhoff: prometheus: add directory size collector (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite)
[10:01:34] <wikibugs>	 (03CR) 10Ema: [C: 032] ATS: define global Lua scripts in plugin.config [puppet] - 10https://gerrit.wikimedia.org/r/477974 (https://phabricator.wikimedia.org/T207048) (owner: 10Ema)
[10:02:46] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754
[10:02:54] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211309 (10ops-monitoring-bot)
[10:03:05] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Mathew.onipe)
[10:06:33] <wikibugs>	 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10Aklapper) I've seen a few people not understanding `[Report Only] Refused to connect to blah`, thinking it is an error. I can only point to http://bots.wmflabs.org/~...
[10:10:14] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Mathew.onipe)
[10:12:09] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Mathew.onipe)
[10:14:01] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211310 (10ops-monitoring-bot)
[10:14:58] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Mathew.onipe)
[10:17:04] <icinga-wm>	 PROBLEM - puppet last run on an-worker1087 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 13 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hadoop-client],Package[libhdfs0]
[10:19:18] <wikibugs>	 (03PS1) 10Hoo man: Wikidata: Display Kartographer mapframes for geocoordinate statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477976 (https://phabricator.wikimedia.org/T184933)
[10:19:39] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn)
[10:22:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "jenkins +2ed https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475260/ with this change." [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris)
[10:22:26] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754
[10:22:37] <wikibugs>	 (03PS2) 10Mathew.onipe: setup: change curator version to 4.2.5 to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958
[10:24:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "rebased on top of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/477754/ jenkins +2s. an extended PCC run also passed so I 'll be " [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn)
[10:27:14] <icinga-wm>	 RECOVERY - puppet last run on an-worker1087 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:29:26] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211311 (10ops-monitoring-bot)
[10:29:55] <banyek>	 !log depooling db1096 for schema change - T85757
[10:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:59] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[10:30:18] <wikibugs>	 (03CR) 10Banyek: [C: 032] mariadb: depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477589 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek)
[10:31:22] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477589 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek)
[10:31:50] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211313 (10ops-monitoring-bot)
[10:33:57] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477589 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek)
[10:36:57] <logmsgbot>	 !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1096:3315 (duration: 00m 49s)
[10:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:02] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[10:38:00] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211315 (10ops-monitoring-bot)
[10:41:38] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211316 (10ops-monitoring-bot)
[10:43:15] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211317 (10ops-monitoring-bot)
[10:48:21] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211318 (10ops-monitoring-bot)
[10:50:07] <wikibugs>	 (03PS4) 10Volans: extdist: Switch to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/475579 (https://phabricator.wikimedia.org/T210312) (owner: 10Legoktm)
[10:51:00] <wikibugs>	 (03CR) 10Volans: [C: 032] "Great, merging." [puppet] - 10https://gerrit.wikimedia.org/r/475579 (https://phabricator.wikimedia.org/T210312) (owner: 10Legoktm)
[10:51:02] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211319 (10ops-monitoring-bot)
[10:53:42] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211320 (10ops-monitoring-bot)
[10:55:37] <wikibugs>	 (03PS1) 10MarcoAurelio: Add NS_PROJECT localised name for tt.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477979 (https://phabricator.wikimedia.org/T211312)
[10:56:10] <wikibugs>	 (03PS1) 10Banyek: Revert "mariadb: depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477980
[10:56:16] <banyek>	 !log repooling db1096 for schema change - T85757
[10:56:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:19] <stashbot>	 T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757
[10:56:46] <volans>	 !log disable event handler on Icinga for ms-be2047 MD Raid and MegaRAID checks, it's spamming Phabricator - T209921
[10:56:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:49] <stashbot>	 T209921: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921
[10:56:50] <volans>	 godog: ^^^
[10:57:01] <volans>	 I'm cleaning the takss on phab, sorry about that
[10:57:11] <wikibugs>	 (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477980 (owner: 10Banyek)
[10:58:13] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477980 (owner: 10Banyek)
[10:58:21] <wikibugs>	 (03PS2) 10MarcoAurelio: Add NS_PROJECT localised name for tt.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477979 (https://phabricator.wikimedia.org/T211312)
[10:58:50] <wikibugs>	 (03PS1) 10Elukey: profile::cdh::apt: add custom exec for apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/477981
[10:59:49] <logmsgbot>	 !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1096:3315 (duration: 00m 47s)
[10:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:57] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477980 (owner: 10Banyek)
[11:01:19] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::cdh::apt: add custom exec for apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/477981 (owner: 10Elukey)
[11:01:26] <wikibugs>	 (03PS2) 10Elukey: profile::cdh::apt: add custom exec for apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/477981
[11:05:11] <wikibugs>	 (03CR) 10GTirloni: prometheus: add directory size collector (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite)
[11:08:14] <wikibugs>	 (03CR) 10Mathew.onipe: setup: change curator version to 4.2.5 to match our current elasticsearch version (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 (owner: 10Mathew.onipe)
[11:09:55] <wikibugs>	 (03PS3) 10Mathew.onipe: setup: change curator version to 4.2.5 to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958
[11:13:54] <wikibugs>	 (03CR) 10Volans: setup: change curator version to 4.2.5 to match our current elasticsearch version (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 (owner: 10Mathew.onipe)
[11:15:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Add kerberos puppet wrapper (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/477987
[11:16:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add kerberos puppet wrapper (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/477987 (owner: 10Muehlenhoff)
[11:19:26] <wikibugs>	 (03CR) 10Muehlenhoff: prometheus: add directory size collector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite)
[11:20:54] <wikibugs>	 (03PS2) 10Muehlenhoff: Add kerberos puppet wrapper (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/477987
[11:24:10] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211320 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:24:17] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211319 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:24:24] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211318 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:24:30] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211317 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:24:37] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211316 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:25:08] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211315 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:25:14] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211313 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:25:22] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211311 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:25:33] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211310 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:25:38] <wikibugs>	 (03CR) 10Phuedx: EventLogging Logstash filter: move useful fields out of event (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza)
[11:25:39] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211309 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:25:46] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211308 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:25:54] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211307 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:26:02] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211304 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:26:08] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211303 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:26:16] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211302 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:26:22] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211301 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:26:31] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211300 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:26:34] <akosiaris_>	 lol ?
[11:26:38] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211299 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:26:44] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211298 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:26:51] <akosiaris_>	 that many duplicates ?
[11:26:52] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211296 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:26:55] <mobrovac>	 akosiaris_: mass-closing duplicates
[11:26:57] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211293 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:26:59] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211295 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:01] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211292 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:04] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211290 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:06] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211291 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:08] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211289 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:10] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211288 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:12] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211283 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:15] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211286 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:16] <wikibugs>	 (03CR) 10Phuedx: [C: 031] "Including Gergo's review above too." [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza)
[11:27:18] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211285 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:20] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211287 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:22] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211284 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:24] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211282 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:27] <akosiaris_>	 mobrovac: with a script I guess ? 
[11:27:28] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211281 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:30] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211279 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:34] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211278 (10mobrovac) 05Open>03Invalid Duplicate of {T209921}
[11:27:38] <mobrovac>	 akosiaris_: mass-edit functionality of phab
[11:27:52] <akosiaris_>	 what ? TIL 
[11:28:17] <volans>	 akosiaris_: https://www.mediawiki.org/wiki/Phabricator/Help#Batch_edits
[11:28:23] <akosiaris_>	 https://www.mediawiki.org/wiki/Phabricator/Help#Batch_edits
[11:28:25] <volans>	 but you have to be in a specific group 
[11:28:28] <akosiaris_>	 yeah never used that before 
[11:28:33] <volans>	 I couldn't add myself even from admin
[11:28:38] <akosiaris_>	 ahahaha
[11:28:56] <akosiaris_>	 so wait, you can't add yourself, but mobrovac can actually do it ?
[11:28:57] <mobrovac>	 volans: that's called segregation of power
[11:29:05] <mobrovac>	 i'm in the group
[11:29:13] <volans>	 I think it depends on acl*phabricator somehow
[11:31:23] <wikibugs>	 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10chasemp) Good points @aklapper.  I am not sure if this wording is ours or default.  I am making a note to discuss with #security-team.  One question, I have done som...
[11:31:27] <wikibugs>	 (03CR) 10Phuedx: [C: 031] "You might also consider adding a remove_field filter to strip out the not-very-helpful schema field (which is always "EventError")." [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza)
[11:31:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "PCC says noop at https://puppet-compiler.wmflabs.org/compiler1002/13850/, so after the dependent change is merged today, we can probably m" [puppet] - 10https://gerrit.wikimedia.org/r/475261 (owner: 10Dzahn)
[11:35:15] <wikibugs>	 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Volans) @Papaul @fgiunchedi Today the RAID alarm was continuously flapping and created a ton of tasks (see above) that I asked mo.brovac to close as he had access to the batch edit interface in Phabricator....
[11:36:07] <wikibugs>	 10Operations, 10User-jijiki: Create a  mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10Joe) I would add another requirement:  - we want all mediawiki cronjobs to only run in the datacenter where mediawiki is active right now.
[11:36:47] <wikibugs>	 10Operations, 10User-jijiki: Create a  mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki)
[11:36:56] <wikibugs>	 (03PS1) 10Ema: ATS: define one single global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/477990 (https://phabricator.wikimedia.org/T207048)
[11:37:09] <wikibugs>	 10Operations, 10User-jijiki: Create a  mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki) >>! In T211250#4803144, @Joe wrote: > I would add another requirement: >  > - we want all mediawiki cronjobs to only run in the datacenter where mediawiki is active right now.  Added in descr...
[11:37:49] <wikibugs>	 (03PS3) 10Muehlenhoff: Add kerberos puppet wrapper (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/477987
[11:40:38] <wikibugs>	 (03PS2) 10Ema: ATS: define one single global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/477990 (https://phabricator.wikimedia.org/T207048)
[11:42:23] <wikibugs>	 (03PS1) 10Ema: ATS: do not cache files bigger than 1GB [puppet] - 10https://gerrit.wikimedia.org/r/477991 (https://phabricator.wikimedia.org/T209021)
[11:49:23] <Hauskatze>	 jouncebot: next
[11:49:23] <jouncebot>	 In 0 hour(s) and 10 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1200)
[11:52:07] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: prometheus: Add a service label for OTRS [puppet] - 10https://gerrit.wikimedia.org/r/385385 (owner: 10Alexandros Kosiaris)
[11:58:36] <wikibugs>	 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10Aklapper) > Where are people seeing this? Console of the web browser's Developer Tools
[11:59:54] <moritzm>	 !log installing nginx updates on mw canaries
[11:59:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1200).
[12:00:04] <jouncebot>	 Hauskatze: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:13] <Hauskatze>	 o/
[12:05:32] <Hauskatze>	 anyone for SWAT?
[12:06:27] <wikibugs>	 (03PS3) 10Ema: ATS: define one single global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/477990 (https://phabricator.wikimedia.org/T207048)
[12:07:27] <wikibugs>	 (03CR) 10Ema: [C: 032] ATS: define one single global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/477990 (https://phabricator.wikimedia.org/T207048) (owner: 10Ema)
[12:14:09] <Hauskatze>	 Lucas_WMDE: hi
[12:14:45] <wikibugs>	 (03PS2) 10Ema: ATS: do not cache files bigger than 1GB [puppet] - 10https://gerrit.wikimedia.org/r/477991 (https://phabricator.wikimedia.org/T209021)
[12:16:32] <wikibugs>	 (03CR) 10Ema: [C: 032] ATS: do not cache files bigger than 1GB [puppet] - 10https://gerrit.wikimedia.org/r/477991 (https://phabricator.wikimedia.org/T209021) (owner: 10Ema)
[12:21:30] <Hauskatze>	 No one avalaible for deploying a little patch?
[12:21:58] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema)
[12:22:48] <Lucas_WMDE>	 Hauskatze: not me, sorry, I’m in a meeting
[12:23:01] <Lucas_WMDE>	 (addshore not available either for the same reason)
[12:23:04] <Hauskatze>	 Lucas_WMDE: thanks for replying, at least
[12:23:46] <Hauskatze>	 well, I'll keep waiting :)
[12:23:57] <Lucas_WMDE>	 good luck :)
[12:24:08] <Lucas_WMDE>	 apparently this is a bad week for SWAT, due to some releng offsite :/
[12:24:35] <Hauskatze>	 yup, but at Wikitech says SWATs are okay
[12:24:52] <Hauskatze>	 no MW train though
[12:28:28] <wikibugs>	 10Operations: SRE quarterly goal: Ability to serve a fraction of the production traffic from PHP7 - https://phabricator.wikimedia.org/T206336 (10Joe)
[12:28:34] <wikibugs>	 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe) 05Open>03Resolved
[12:29:41] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 031] "With '-exact',  rebuilding python-thumbor-wikimedia for Debian stretch was successful." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) (owner: 10Gilles)
[12:31:43] <wikibugs>	 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10User-ArielGlenn, 10User-Joe: Correctly collect logs from php-fpm pools - https://phabricator.wikimedia.org/T211184 (10Joe)
[12:38:06] <wikibugs>	 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Banyek) I'll take a look into this
[12:40:21] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema)
[12:40:24] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) 05Open>03Resolved All the functionalities currently provided by our varnish backends in terms of request/response mangling have been implemented, with two exceptions:  1. ATS...
[12:42:56] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::hadoop::common: explicitly contain class cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/477995
[12:45:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13852/ at the compiler level it's a noop but it could break puppet on all servers inclidi" [puppet] - 10https://gerrit.wikimedia.org/r/477995 (owner: 10Giuseppe Lavagetto)
[12:47:29] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hadoop::common: explicitly contain class cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/477995 (owner: 10Giuseppe Lavagetto)
[12:53:15] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 44 probes of 335 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[12:55:43] <moritzm>	 !log installing nginx updates on mw in codfw
[12:55:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: puppet-merge: avoid warning in numeric equality [puppet] - 10https://gerrit.wikimedia.org/r/477996
[12:56:36] <gehel>	 !log depooling and shutting down elasticsearch on elastic2001-2024 - T211023
[12:56:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:40] <stashbot>	 T211023: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023
[12:58:11] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 13 probes of 335 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[13:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1300)
[13:00:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - search-https_9243: Servers elastic2031.codfw.wmnet, elastic2042.codfw.wmnet, elastic2041.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2027.codfw.wmnet, elastic2026.codfw.wmnet, elastic2038.codfw.wmnet, elastic2039.codfw.wmnet, elastic2054.codfw.wmnet, elastic2035.codfw.wmnet, elastic2037.codf
[13:00:57] <icinga-wm>	 025.codfw.wmnet, elastic2051.codfw.wmnet, elastic2044.codfw.wmnet, elastic2040.codfw.wmnet, elastic2045.codfw.wmnet, elastic2043.codfw.wmnet, elastic2034.codfw.wmnet, elastic2036.codfw.wmnet, elastic2049.codfw.wmnet, elastic2032.codfw.wmnet, elastic2028.codfw.wmnet, elastic2030.codfw.wmnet, elastic2046.codfw.wmnet, elastic2047.codfw.wmnet are marked down but pooled: search_9200: Servers elastic2031.codfw.wmnet, elastic2042.codfw.
[13:00:57] <icinga-wm>	 1.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2026.codfw.wmnet, elastic2038.codfw.wmnet, elastic2050.codfw.wmnet, elastic2048.codfw.wm
[13:01:05] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([elastic2031.codfw.wmnet, elastic2042.codfw.wmnet, elastic2027.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2040.codfw.wmnet, elastic2038.codfw.wmnet, elastic2050.codfw.wmnet, elastic2045.codfw.wmnet, elastic2039.codfw.wmnet, elastic2048.codfw.wmnet, elastic2054.codfw.wmnet, e
[13:01:05] <icinga-wm>	 wmnet, elastic2037.codfw.wmnet, elastic2043.codfw.wmnet, elastic2051.codfw.wmnet, elastic2052.codfw.wmnet, elastic2044.codfw.wmnet, elastic2026.codfw.wmnet, elastic2041.codfw.wmnet, elastic2025.codfw.wmnet, elastic2034.codfw.wmnet, elastic2036.codfw.wmnet, elastic2049.codfw.wmnet, elastic2032.codfw.wmnet, elastic2028.codfw.wmnet, elastic2030.codfw.wmnet, elastic2046.codfw.wmnet, elastic2047.codfw.wmnet])
[13:01:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - search-https_9243: Servers elastic2031.codfw.wmnet, elastic2042.codfw.wmnet, elastic2041.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2027.codfw.wmnet, elastic2026.codfw.wmnet, elastic2050.codfw.wmnet, elastic2048.codfw.wmnet, elastic2054.codfw.wmnet, elastic2035.codfw.wmnet, elastic2037.codf
[13:01:15] <icinga-wm>	 025.codfw.wmnet, elastic2051.codfw.wmnet, elastic2044.codfw.wmnet, elastic2040.codfw.wmnet, elastic2045.codfw.wmnet, elastic2043.codfw.wmnet, elastic2034.codfw.wmnet, elastic2036.codfw.wmnet, elastic2049.codfw.wmnet, elastic2032.codfw.wmnet, elastic2028.codfw.wmnet, elastic2030.codfw.wmnet, elastic2046.codfw.wmnet, elastic2047.codfw.wmnet are marked down but pooled: search_9200: Servers elastic2031.codfw.wmnet, elastic2042.codfw.
[13:01:15] <icinga-wm>	 1.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2027.codfw.wmnet, elastic2026.codfw.wmnet, elastic2038.codfw.wmnet, elastic2050.codfw.wm
[13:01:25] <gehel>	 ^ that's me, should be back in a second (and yes, it is a real issue)
[13:02:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy
[13:02:15] <elukey>	 \
[13:02:20] <elukey>	 (sorry typo)
[13:02:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy
[13:06:19] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal
[13:07:51] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023)
[13:11:11] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 3443 threshold =0.15 breach: status: yellow, number_of_nodes: 53, unassigned_shards: 3346, number_of_pending_tasks: 3147, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3169, task_max_waiting_in_queue_millis: 586849, cluster_name: production-search-codfw, relocating_shards: 0, act
[13:11:11] <icinga-wm>	 t_as_number: 63.7235275524, active_shards: 6048, initializing_shards: 97, number_of_data_nodes: 53, delayed_unassigned_shards: 12
[13:11:25] <_joe_>	 gehel: gat us gaooebubg>
[13:11:34] <_joe_>	 err off by one
[13:11:36] <akosiaris_>	 ?
[13:11:37] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel)
[13:11:38] <_joe_>	 what is happening
[13:11:42] <akosiaris_>	 the entire sentence ?
[13:11:43] <akosiaris_>	 lol
[13:11:44] <_joe_>	 gehel: onimisionipe 
[13:12:14] <_joe_>	 pybal is ok but the elasticsearch error?
[13:12:24] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] elasticsearch: move master eligible nodes to new servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel)
[13:12:37] <akosiaris_>	 _joe_ gehel pointed out already it's him 
[13:12:45] <onimisionipe>	 _joe_: gehel is on it...
[13:12:45] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::worker: move prometheus require before cdh class [puppet] - 10https://gerrit.wikimedia.org/r/477998
[13:12:46] <_joe_>	 akosiaris_: yeah, the pybal errors
[13:12:57] <_joe_>	 but the health check issue is not expected
[13:13:02] <_joe_>	 AFAICT
[13:13:05] <_joe_>	 ok cool
[13:14:31] <gehel>	 for some definition of "expected"
[13:14:32] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::hadoop::worker: move prometheus require before cdh class [puppet] - 10https://gerrit.wikimedia.org/r/477998 (owner: 10Elukey)
[13:14:52] <gehel>	 I messed up the step order in the decom of old elastic servers
[13:15:12] <gehel>	 cluster is recovering already, but some indices are still in error
[13:15:18] <akosiaris_>	 need help ?
[13:15:24] <gehel>	 not a huge deal, this is codfw, no user traffic
[13:15:29] <akosiaris_>	 ok
[13:15:35] <gehel>	 nope, all is good, just need to wait at this point
[13:16:14] * gehel is going to get the 9 cat tails and flog himself as penitence
[13:17:22] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 2417 threshold =0.15 breach: status: yellow, number_of_nodes: 54, unassigned_shards: 2334, number_of_pending_tasks: 386, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3169, task_max_waiting_in_queue_millis: 366680, cluster_name: production-search-codfw, relocating_shards:
[13:17:23] <icinga-wm>	 _percent_as_number: 74.5337688336, active_shards: 7074, initializing_shards: 83, number_of_data_nodes: 54, delayed_unassigned_shards: 0 Gehel cluster recovering, should be green in a few minutes
[13:20:40] <wikibugs>	 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Banyek)
[13:22:40] <apergos>	 there is a lot of elastichsearch cronspam still coming in, is that expected?
[13:22:46] <apergos>	 gehel: 
[13:22:55] <volans>	 apergos: yes, see backlog
[13:23:11] <dcausse>	 apergos: something with hotthreads?
[13:23:20] <gehel>	 oh, yeah forgot about hose as well
[13:23:39] <gehel>	 expected, not an issue except for the spam itself
[13:23:41] <gehel>	 on it
[13:23:59] <apergos>	 thank you!
[13:25:47] <wikibugs>	 10Operations, 10Datasets-General-or-Unknown, 10WMDE-Analytics-Engineering, 10Wikidata: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739 (10Addshore) For reference for any future people looking at this, this is currently used by:  - https://github.com/wikimedia/puppet/blob/...
[13:26:36] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 54, unassigned_shards: 1352, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3169, task_max_waiting_in_queue_millis: 924614, cluster_name: production-search-codfw, relocating_shards: 0, active_shards_percent_as_numbe
[13:26:36] <icinga-wm>	  active_shards: 8084, initializing_shards: 55, number_of_data_nodes: 54, delayed_unassigned_shards: 0
[13:27:25] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: ignore output of hot thread cron [puppet] - 10https://gerrit.wikimedia.org/r/477999
[13:31:02] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023)
[13:32:33] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch: ignore output of hot thread cron [puppet] - 10https://gerrit.wikimedia.org/r/477999 (owner: 10Gehel)
[13:33:00] <wikibugs>	 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Banyek) I checked the query `SELECT /* Wikimedia\Rdbms\Database::select www-data@mwmain... */  DISTINCT( pa_project_id )  FROM `pa...
[13:36:45] <akosiaris_>	 on the move, bbl
[13:37:38] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] elasticsearch: move master eligible nodes to new servers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel)
[13:38:09] <wikibugs>	 (03PS1) 10Ema: ATS: support Collapsed Forwarding [puppet] - 10https://gerrit.wikimedia.org/r/478003 (https://phabricator.wikimedia.org/T207048)
[13:41:37] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] elasticsearch: move master eligible nodes to new servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel)
[13:43:43] <wikibugs>	 (03PS3) 10Gehel: elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023)
[13:46:50] <moritzm>	 !log upgrading spamassassin on mx2001
[13:46:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:46] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel)
[13:49:02] <wikibugs>	 (03PS4) 10Gehel: elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023)
[13:50:00] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel)
[13:54:29] <wikibugs>	 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Banyek) 05Open>03Invalid This is a duplicate of T208231
[13:55:01] <wikibugs>	 10Operations, 10DBA: Issues with purgeUnusedProjects.php cron job on mwmaint1002  (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10Banyek)
[13:59:40] <icinga-wm>	 PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:00:19] <volans>	 gehel: elasticsearch_5@production-search-omega-codfw.service failed
[14:00:30] <gehel>	 yep, looking
[14:00:56] * volans a bit surprised to have a codfw service in eqiad ;)
[14:01:13] <gehel>	 it shouldn't be there
[14:01:17] <gehel>	 obviously
[14:01:30] <gehel>	 templated systemd unit
[14:01:37] <volans>	 got it
[14:01:43] * gehel is looking through his history to see what he did wrong
[14:03:06] <icinga-wm>	 RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational
[14:13:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Diamond from DNS roles [puppet] - 10https://gerrit.wikimedia.org/r/478016 (https://phabricator.wikimedia.org/T183454)
[14:16:51] <moritzm>	 !log installing nginx security updates on mw in eqiad
[14:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:52] <wikibugs>	 10Operations, 10DBA: Issues with purgeUnusedProjects.php cron job on mwmaint1002  (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10Banyek)
[14:22:08] <wikibugs>	 10Operations, 10DBA, 10User-Banyek: Issues with purgeUnusedProjects.php cron job on mwmaint1002  (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10Banyek)
[14:25:38] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, 10Multimedia: Cron <www-data@mwmaint1002> /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T211270 (10jijiki) 05Open>03Invalid Duplicate of T150375
[14:26:33] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, 10Multimedia: Cron <www-data@mwmaint1002> /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T211270 (10Zoranzoki21) 05Invalid>03Open
[14:26:49] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, 10Multimedia: Cron <www-data@mwmaint1002> /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T211270 (10Zoranzoki21)
[14:26:57] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, and 3 others: cronspam cleanup: Cron <www-data@terbium> /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T150375 (10Zoranzoki21)
[14:28:54] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, and 3 others: cronspam cleanup: Cron <www-data@terbium> /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T150375 (10jijiki) @Aklapper We have resolved it on IR...
[14:36:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-merge: avoid warning in numeric equality [puppet] - 10https://gerrit.wikimedia.org/r/477996 (owner: 10Giuseppe Lavagetto)
[14:36:44] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: puppet-merge: avoid warning in numeric equality [puppet] - 10https://gerrit.wikimedia.org/r/477996
[14:37:14] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10cloud-services-team: Degraded RAID on cloudvirtan1001 - https://phabricator.wikimedia.org/T211235 (10jijiki) p:05Triage>03High
[14:37:50] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10cloud-services-team: Degraded RAID on cloudvirtan1001 - https://phabricator.wikimedia.org/T211235 (10jijiki) The alert from icinga is gone, close this if you believe everything is ok :)
[14:46:39] <moritzm>	 !log uploaded nodejs 10.4.0~dfsg-1+wmf2 to apt.wikimedia.org/component/node10 (backports of recent security fixes)
[14:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:53] <elukey>	 moritzm: good if I move turnilo to node 10?
[14:51:34] <wikibugs>	 (03PS1) 10Elukey: Move remaining stat1005 references to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/478020 (https://phabricator.wikimedia.org/T205846)
[14:51:49] <wikibugs>	 10Operations, 10Wikimedia-Logstash: Procure and provision Logging pipeline hardware in multiple datacenters - https://phabricator.wikimedia.org/T205850 (10herron)
[14:51:52] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10herron)
[14:53:13] <moritzm>	 elukey: let's go ahead!
[14:56:36] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754
[14:56:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris)
[14:59:36] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/turnilo/deploy@6bd6e2f]: upgrade deps to nodejs 10
[14:59:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:44] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/turnilo/deploy@6bd6e2f]: upgrade deps to nodejs 10 (duration: 00m 09s)
[14:59:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:05] <elukey>	 moritzm: done!
[15:01:00] <moritzm>	 elukey: congratulations for being the first non-k8s service to migrate :-)
[15:01:08] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn)
[15:01:14] <elukey>	 \o/
[15:01:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn)
[15:01:29] <wikibugs>	 (03PS2) 10Muehlenhoff: Install Imagemagick policy files for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/476818
[15:04:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Install Imagemagick policy files for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/476818 (owner: 10Muehlenhoff)
[15:05:11] <gehel>	 !log upgrade nginx on wdqs servers
[15:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Hotfix for logging in php-fpm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478021 (https://phabricator.wikimedia.org/T211184)
[15:08:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Hotfix for logging in php-fpm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478021 (https://phabricator.wikimedia.org/T211184) (owner: 10Giuseppe Lavagetto)
[15:08:50] <gehel>	 !log restartign new elasticsearch masters on codfw - T211023
[15:08:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:53] <stashbot>	 T211023: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023
[15:09:02] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn)
[15:09:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn)
[15:09:48] <icinga-wm>	 PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:10:10] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: upgrade puppet stdlib from 4.24.0 to 4.25.1 [puppet] - 10https://gerrit.wikimedia.org/r/475261 (owner: 10Dzahn)
[15:12:07] <elukey>	 Dec  6 15:04:43 analytics1066 puppet-agent[15161]: Could not retrieve catalog from remote server: Error 503 on SERVER: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
[15:12:44] <elukey>	 maybe is it a temp glitch due to the stdlib migration?
[15:14:57] <elukey>	 ah yes a puppet run fixes it
[15:15:00] <icinga-wm>	 RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[15:26:50] <akosiaris>	 it would be
[15:26:54] <akosiaris>	 for 503 with HTML ?
[15:27:00] <akosiaris>	 it could* be
[15:27:26] <elukey>	 no idea, after a manual puppet run the new stdlib was deployed and everything went to normal
[15:27:41] <akosiaris>	 there is a race condition when merging changes for files that get added/removed
[15:28:10] <akosiaris>	 it's been the catalog actually being compiled and being applied
[15:28:22] <akosiaris>	 it might very well be a catalog with the old change
[15:28:53] <akosiaris>	 and the new revision might no longer have the files mentioned in the catalog (or have different versions)
[15:29:13] <wikibugs>	 (03PS1) 10Elukey: profile::statistics::private: allow labsdb to push nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/478022 (https://phabricator.wikimedia.org/T211330)
[15:29:21] <akosiaris>	 it coalesces pretty quickly
[15:29:40] <akosiaris>	 the entire window is probably something like 3-5 secs
[15:32:48] <wikibugs>	 (03PS2) 10Elukey: profile::statistics::private: allow labsdb to push nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/478022 (https://phabricator.wikimedia.org/T211330)
[15:32:58] <wikibugs>	 (03CR) 10ArielGlenn: [C: 031] "The labstore/dumps related pieces of this look fine to me, as long as all the rest of stat1005's functionality has also been moved over." [puppet] - 10https://gerrit.wikimedia.org/r/478020 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey)
[15:38:40] <moritzm>	 !log uploaded nodejs 6.11~dfsg-1+wmf5 for stretch-wikimedia (the upstream patch for CVE-2018-12122 had a regression, this update fixes it)
[15:38:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:02] <_joe_>	 akosiaris: are you done with stdlib?
[15:39:25] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: wmflib: make the role() function store a path in $::_role [puppet] - 10https://gerrit.wikimedia.org/r/475498
[15:39:26] <_joe_>	 I want to merge my hiera changes 
[15:39:33] <_joe_>	 at least the first two
[15:40:59] <akosiaris>	 _joe_: yes
[15:41:10] <akosiaris>	 the next version bump is for Monday
[15:42:13] <_joe_>	 ack
[15:42:30] <_joe_>	 !log disabling puppet fleet-wide for a change in the role() function
[15:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: make the role() function store a path in $::_role [puppet] - 10https://gerrit.wikimedia.org/r/475498 (owner: 10Giuseppe Lavagetto)
[15:42:51] <fsero>	 !log modifying zotero deploy CLUSTER=codfw scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero - T211322
[15:42:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:55] <stashbot>	 T211322: zotero pods on codfw should use codfw url downloader - https://phabricator.wikimedia.org/T211322
[15:45:01] <logmsgbot>	 !log fsero@deploy1001 scap-helm zotero upgrade production -f ../zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw]
[15:45:02] <logmsgbot>	 !log fsero@deploy1001 scap-helm zotero cluster codfw completed
[15:45:02] <logmsgbot>	 !log fsero@deploy1001 scap-helm zotero finished
[15:45:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:54] <wikibugs>	 (03CR) 10ArielGlenn: [C: 031] "Seems ok to me, though if this temporary solution turns into a long term one, we don't really need to give everything in $statistics_serve" [puppet] - 10https://gerrit.wikimedia.org/r/478022 (https://phabricator.wikimedia.org/T211330) (owner: 10Elukey)
[15:51:20] <moritzm>	 !log upgrading spamassassin on mx1001/fermium
[15:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:18] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 141.7 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen
[16:02:05] <wikibugs>	 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Andy.Johnson@dell.com   9:40 AM (21 minutes ago)   to me, faidon  Dell Customer Communication     Here is a link to the Dell Support Live Image (SLI) Version 3.0. with this we can  test the hardware...
[16:12:49] <wikibugs>	 (03PS2) 10Herron: logstash: ship kafka server logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788)
[16:17:27] <gehel>	 !log shutting down elasticsearch on elastic2001-2024 (second try) - T211023
[16:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:31] <stashbot>	 T211023: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023
[16:21:35] <wikibugs>	 10Operations, 10Phabricator: Switch PHP-FPM on phab1002 - https://phabricator.wikimedia.org/T211353 (10Paladox)
[16:23:07] <wikibugs>	 (03PS3) 10Gehel: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe)
[16:27:17] <urandom>	 !log decommissioning cassandra-a, restbase2001 -- T210843
[16:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:21] <stashbot>	 T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843
[16:39:26] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Banyek) @Ottomata Yes, I don't see anything against this. Just make sure that the data is copied over a secure channel and get removed both the export...
[16:44:49] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] elasticsearch: Remove elastic2001-elastic2024 from codfw cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe)
[16:45:42] <wikibugs>	 (03PS1) 10Andrew Bogott: no-op patch for tox testing purposes [software/cumin] - 10https://gerrit.wikimedia.org/r/478026
[16:56:33] <wikibugs>	 (03PS1) 10Bmansurov: Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882)
[16:57:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov)
[16:58:01] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: purge main elasticsearch configuration directory [puppet] - 10https://gerrit.wikimedia.org/r/478029 (https://phabricator.wikimedia.org/T211023)
[17:00:04] <jouncebot>	 godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1700).
[17:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[17:02:42] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Gehel) We need to reassign some nodes between the psi and omega cluster, as removing old nodes would leave the clusters unbalanced between rows.  This will require...
[17:05:01] <wikibugs>	 10Operations, 10DBA, 10User-Banyek: Issues with purgeUnusedProjects.php cron job on mwmaint1002  (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10Banyek) i'd like to add the owner of the script as a subscriber, but I don't know how to find who is it
[17:05:34] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 031] "Are there already entries in the database that must be updated accordingly? Please make sure this happens the same time this is merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477856 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm)
[17:08:25] <wikibugs>	 (03PS2) 10Bmansurov: Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882)
[17:09:26] <wikibugs>	 (03CR) 10Gehel: "PCC looks reasonable (https://puppet-compiler.wmflabs.org/compiler1002/13854/), checking a few servers (both elastic and logstash) it look" [puppet] - 10https://gerrit.wikimedia.org/r/478029 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel)
[17:09:47] <wikibugs>	 (03PS1) 10Volans: Add ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/478030 (https://phabricator.wikimedia.org/T205884)
[17:11:46] <moritzm>	 !log uploaded nodejs 6.11~dfsg-1+wmf5 for jessie-wikimedia (the upstream patch for CVE-2018-12122 had a regression, this update fixes it)
[17:11:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:03] <wikibugs>	 (03PS7) 10Paladox: profile::phabricator::httpd: Fix worker configs and also use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925
[17:15:38] <wikibugs>	 (03PS4) 10Paladox: httpd::mpm: Add php7.0 and php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/477587
[17:15:40] <wikibugs>	 (03PS6) 10Paladox: phabricator: Increase 'max_execution_time' to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595
[17:16:59] <wikibugs>	 (03PS1) 10Paladox: phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032
[17:17:28] <wikibugs>	 (03PS2) 10Paladox: phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032
[17:17:54] <wikibugs>	 (03PS3) 10Paladox: phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353)
[17:17:56] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) (owner: 10Paladox)
[17:18:15] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch: purge main elasticsearch configuration directory [puppet] - 10https://gerrit.wikimedia.org/r/478029 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel)
[17:31:44] <moritzm>	 !log installing nodejs updates on proton*
[17:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:23] <wikibugs>	 (03PS1) 10Shreyasminocha: Add HD logos for 3 projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478034
[17:45:32] <wikibugs>	 (03PS1) 10Shreyasminocha: Update settings to include new HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478036
[17:46:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update settings to include new HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478036 (owner: 10Shreyasminocha)
[18:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1800).
[18:00:18] <subbu>	 no parsoid deploy today
[18:13:54] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003545, end_log_pos 943920774
[18:17:11] <logmsgbot>	 !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@1dba3cd]: Internally promisify page processing steps (T202642)
[18:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:15] <stashbot>	 T202642: Investigate how to fix the performance problems caused by CPU bound work on the MCS services - https://phabricator.wikimedia.org/T202642
[18:18:09] <wikibugs>	 (03CR) 10Elukey: "@Bstorm: any concern from the labsdb side?" [puppet] - 10https://gerrit.wikimedia.org/r/478020 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey)
[18:18:25] <wikibugs>	 (03CR) 10Elukey: "s/labsdb/labstore/" [puppet] - 10https://gerrit.wikimedia.org/r/478020 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey)
[18:19:09] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review: Switch PHP-FPM on phab1002 - https://phabricator.wikimedia.org/T211353 (10Dzahn) a:03Dzahn
[18:19:18] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review: Switch PHP-FPM on phab1002 - https://phabricator.wikimedia.org/T211353 (10Dzahn) p:05Triage>03Normal
[18:21:01] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch: purge main elasticsearch configuration directory [puppet] - 10https://gerrit.wikimedia.org/r/478029 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel)
[18:21:04] <logmsgbot>	 !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@1dba3cd]: Internally promisify page processing steps (T202642) (duration: 03m 54s)
[18:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:36] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 765.51 seconds
[18:29:22] <wikibugs>	 (03PS1) 10Bstorm: gridengine: simplifying the config and making it more "normal" for grid [puppet] - 10https://gerrit.wikimedia.org/r/478041 (https://phabricator.wikimedia.org/T211258)
[18:31:05] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) ehm. i spent time on puppetizing this to make sure bmansurov's import script gets installed in a way that doesn't conflict with server access po...
[18:31:15] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Gehel) a:03Gehel
[18:32:18] <wikibugs>	 (03CR) 10Bstorm: [C: 032] gridengine: simplifying the config and making it more "normal" for grid [puppet] - 10https://gerrit.wikimedia.org/r/478041 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm)
[18:34:54] <wikibugs>	 (03PS2) 10Dzahn: Partman: Add logstash200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/477923 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul)
[18:36:48] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Partman: Add logstash200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/477923 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul)
[18:41:41] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Set FileImporter config help location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477798 (https://phabricator.wikimedia.org/T199108) (owner: 10WMDE-Fisch)
[18:42:44] <wikibugs>	 (03PS2) 10Dzahn: DHCP: Add MAC address entries for logstash200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/477911 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul)
[18:46:15] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "for the ones wondering about the different vendor ID in that one MAC." [puppet] - 10https://gerrit.wikimedia.org/r/477911 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul)
[18:47:41] <wikibugs>	 (03PS2) 10Dzahn: DNS: Add production and mgmt DNS entries for logstash200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/477868 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul)
[18:51:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10faidon)
[18:53:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10faidon) Per @bd808 on IRC:  labstore1003 is still in use, blocked by T209527. labstore100[12] are not in use at the mom...
[18:53:47] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "also checked matching WMF numbers in netbox" [dns] - 10https://gerrit.wikimedia.org/r/477868 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul)
[18:57:28] <wikibugs>	 (03PS4) 10Gehel: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe)
[18:57:40] <wikibugs>	 (03CR) 10Gehel: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe)
[19:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1900).
[19:00:04] <jouncebot>	 bmansurov: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:11] <bmansurov>	 here
[19:04:55] <bmansurov>	 Who's deploying?
[19:07:10] <bmansurov>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof can anyone deploy?
[19:07:50] <RoanKattouw>	 I can do it
[19:08:00] <bmansurov>	 thanks
[19:08:52] <wikibugs>	 10Operations, 10ops-eqsin: update PDUs for eqsin (asset tag and other info) - https://phabricator.wikimedia.org/T211368 (10RobH) p:05Triage>03Low
[19:09:10] <wikibugs>	 (03CR) 10Catrope: [C: 032] Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov)
[19:10:15] <wikibugs>	 (03Merged) 10jenkins-bot: Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov)
[19:10:54] <wikibugs>	 (03PS4) 10Cwhite: prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094)
[19:11:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite)
[19:14:39] <wikibugs>	 (03PS5) 10Cwhite: prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094)
[19:18:21] <wikibugs>	 (03CR) 10jenkins-bot: Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov)
[19:23:49] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe)
[19:26:29] <wikibugs>	 (03PS5) 10Gehel: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe)
[19:31:44] <Hauskatze>	 jouncebot: next
[19:31:45] <jouncebot>	 In 4 hour(s) and 28 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181207T0000)
[19:31:54] <wikibugs>	 (03CR) 10Cwhite: prometheus: add directory size collector (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite)
[19:32:31] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Remove Diamond from DNS roles [puppet] - 10https://gerrit.wikimedia.org/r/478016 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[19:32:53] <gehel>	 !log shutting down elasticsearch on elastic2001-2024 (third time is a charm) - T211023
[19:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:57] <stashbot>	 T211023: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023
[19:33:09] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10cloud-services-team: Degraded RAID on cloudvirtan1001 - https://phabricator.wikimedia.org/T211235 (10Ottomata) 05Open>03Resolved a:03Ottomata Assuming this was caused by @andrewbogott reformatting the hosts.  Closing.
[19:35:46] <wikibugs>	 (03CR) 10Ottomata: [C: 031] profile::statistics::private: allow labsdb to push nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/478022 (https://phabricator.wikimedia.org/T211330) (owner: 10Elukey)
[19:35:57] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe)
[19:36:51] <bmansurov>	 RoanKattouw: thanks for deploying. The change's working.
[19:37:03] <wikibugs>	 (03PS2) 10Cwhite: wmcs: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326)
[19:38:30] <wikibugs>	 (03CR) 10Cwhite: "> I think there's been some general confusion where the metrics are" [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite)
[19:39:58] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "ack, 10 seemed too aggressive, 30 seems more default. this only influences future stretch system, not prod" [puppet] - 10https://gerrit.wikimedia.org/r/477595 (owner: 10Paladox)
[19:40:29] <wikibugs>	 (03PS7) 10Dzahn: phabricator: Increase 'max_execution_time' to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 (owner: 10Paladox)
[19:41:07] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) Yah thanks for that @dzahn!  The problem is a larger one: how should people get data out of analytics systems for production usage.  The sear...
[19:42:10] <wikibugs>	 (03PS1) 10Herron: rsyslog: increase omkafka timeout to 10s [puppet] - 10https://gerrit.wikimedia.org/r/478045 (https://phabricator.wikimedia.org/T206633)
[19:43:41] <wikibugs>	 (03CR) 10Ottomata: "I think we should keep schema.  If for some reason we ever put other types of events in here, it will be nice to be able to filter them." [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza)
[19:44:10] <icinga-wm>	 PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:44:53] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RStallman-legalteam) To update: switched to MOU & NDA which are now signed and filed with lega...
[19:45:39] <wikibugs>	 (03PS2) 10Herron: rsyslog: increase omkafka timeout to 10s [puppet] - 10https://gerrit.wikimedia.org/r/478045 (https://phabricator.wikimedia.org/T206633)
[19:46:42] <wikibugs>	 (03CR) 10Herron: [C: 032] rsyslog: increase omkafka timeout to 10s [puppet] - 10https://gerrit.wikimedia.org/r/478045 (https://phabricator.wikimedia.org/T206633) (owner: 10Herron)
[19:47:24] <wikibugs>	 (03Abandoned) 10Andrew Bogott: no-op patch for tox testing purposes [software/cumin] - 10https://gerrit.wikimedia.org/r/478026 (owner: 10Andrew Bogott)
[19:47:52] <wikibugs>	 (03PS4) 10Andrew Bogott: Openstack: support multiple regions [software/cumin] - 10https://gerrit.wikimedia.org/r/477811 (https://phabricator.wikimedia.org/T208861)
[19:50:35] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:50:35] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:50:55] <wikibugs>	 (03PS8) 10Dzahn: phabricator: Increase 'max_execution_time' to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 (owner: 10Paladox)
[19:51:28] <mutante>	 herron: ^ do you know why these are already alerting?
[19:51:41] <mutante>	 i know they are brandnew, but this should only happen with a puppet role
[19:51:47] <icinga-wm>	 PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:52:12] <herron>	 hmm
[19:52:21] <mutante>	 oh, sorry, wrong machines :)
[19:52:25] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:52:25] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[19:52:40] <wikibugs>	 (03PS5) 10Paladox: httpd::mpm: Add php7.0 and php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/477587
[19:53:12] <mutante>	 herron: nevermind, they are not what i thought it was  . already from https://phabricator.wikimedia.org/T154251
[19:53:15] <gehel>	 mutante: elastic is me
[19:53:21] <mutante>	 gehel: ok :) thanks
[19:53:49] <gehel>	 I missed a few downtimes, but nothing to worry so far
[19:53:54] <herron>	 kk
[19:53:56] <mutante>	 ok
[19:54:08] <gehel>	 cleanup coming up
[19:57:52] <wikibugs>	 (03CR) 10Cwhite: [C: 032] initial commit (035 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) (owner: 10Cwhite)
[19:59:55] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: force deletion of unmanaged resources in /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/478048
[20:00:28] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: force deletion of unmanaged resources in /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/478048
[20:00:35] <wikibugs>	 (03CR) 10Dzahn: profile::phabricator::httpd: Fix worker configs and also use hiera value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477925 (owner: 10Paladox)
[20:01:26] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch: force deletion of unmanaged resources in /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/478048 (owner: 10Gehel)
[20:01:44] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch: force deletion of unmanaged resources in /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/478048 (owner: 10Gehel)
[20:04:48] <wikibugs>	 (03PS8) 10Paladox: profile::phabricator::httpd: Use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925
[20:05:52] <wikibugs>	 (03PS2) 10Shreyasminocha: Update settings to include new HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478036
[20:07:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478034 (owner: 10Shreyasminocha)
[20:07:21] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478036 (owner: 10Shreyasminocha)
[20:11:16] <wikibugs>	 (03PS1) 10Paladox: profile::phabricator::httpd: Update's worker config to match MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/478052
[20:11:31] <wikibugs>	 (03PS9) 10Dzahn: phabricator: Use hiera value for phabricator_enable_php_fpm [puppet] - 10https://gerrit.wikimedia.org/r/477925 (owner: 10Paladox)
[20:11:38] <wikibugs>	 (03PS10) 10Dzahn: phabricator: Use hiera value for phabricator_enable_php_fpm [puppet] - 10https://gerrit.wikimedia.org/r/477925 (owner: 10Paladox)
[20:11:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::phabricator::httpd: Update's worker config to match MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox)
[20:15:15] <icinga-wm>	 RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational
[20:15:35] <icinga-wm>	 RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational
[20:15:53] <wikibugs>	 10Operations, 10ops-eqiad, 10media-storage, 10Patch-For-Review: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10Cmjohnson)
[20:16:24] <wikibugs>	 10Operations, 10ops-eqiad, 10media-storage, 10Patch-For-Review: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10Cmjohnson) @robh this is already assigned to you but these are ready for you to take over
[20:16:32] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on elastic2025 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1719 days)
[20:16:32] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on elastic2034 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1719 days)
[20:17:41] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "noop https://puppet-compiler.wmflabs.org/compiler1002/13856/" [puppet] - 10https://gerrit.wikimedia.org/r/477925 (owner: 10Paladox)
[20:19:57] <wikibugs>	 (03CR) 10Dzahn: "looks more like "remove mod_php" than adding something. please edit the commit message a bit to explain why this is needed" [puppet] - 10https://gerrit.wikimedia.org/r/477587 (owner: 10Paladox)
[20:21:00] <wikibugs>	 (03PS6) 10Paladox: httpd::mpm: Add php7.0 and php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/477587
[20:21:09] <wikibugs>	 (03PS3) 10Stella: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618)
[20:21:32] <wikibugs>	 (03CR) 10Dzahn: "i think i prefer we first do the switch to phab1002 as production host without doing this and then do this in a second step" [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) (owner: 10Paladox)
[20:21:59] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @GTirloni This has been an ongoing thing since August, I have replaced the battery 3 maybe 4 times already.   Replaced the raid controller once and replaced 4 SS...
[20:23:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) @marostegui and all,  the system board that was replaced yesterday was faulty. Showing errors on DIMM slots B4 and B1.  After swapping DIMMs in B with DIMMs in A, the error remained B4...
[20:25:28] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on elastic2052 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1719 days)
[20:27:18] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on elastic2031 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1719 days)
[20:27:18] <wikibugs>	 (03PS3) 10Herron: logstash: ship kafka server logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788)
[20:28:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10DC-Ops: labvirt1018 -> cloudvirt1018: update physical label, network port description, netbox - https://phabricator.wikimedia.org/T207319 (10Cmjohnson) 05Open>03Resolved
[20:28:34] <wikibugs>	 (03CR) 10Herron: [C: 032] logstash: ship kafka server logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) (owner: 10Herron)
[20:29:24] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Gehel)
[20:29:34] <wikibugs>	 (03PS1) 10Stella: Updated InitialiseSettings.php for HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618)
[20:30:44] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Gehel) a:05Gehel>03RobH elastic2001-2024 are ready for decommission. They are taken our of the cluster and can be shutdown whenever you want (cc @Papaul)
[20:31:24] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Andrew) @Cmjohnson am I correct in understanding that cloudvirt1020 has the exact same issue?  Or has that been resolved somehow?
[20:31:38] <wikibugs>	 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10danstillman) > Have you looked at Domino  We looked at Domino briefly and found some [alarming parsing probl...
[20:33:57] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) Ok, thanks @Ottomata feel free to just hit "restore" on that and apply it on another host once we get there.
[20:34:34] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @andrew yes, you are correct it is the same exact issue. My goal was to work with one, figure out the issue and then go to HPE with a solution but that obviously...
[20:36:19] <wikibugs>	 (03PS2) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052
[20:36:24] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox)
[20:39:47] <wikibugs>	 (03PS1) 10GTirloni: cloudvirt1019: reimage with Stretch [puppet] - 10https://gerrit.wikimedia.org/r/478058 (https://phabricator.wikimedia.org/T196507)
[20:41:12] <wikibugs>	 (03PS3) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052
[20:41:17] <wikibugs>	 (03CR) 10GTirloni: [C: 032] cloudvirt1019: reimage with Stretch [puppet] - 10https://gerrit.wikimedia.org/r/478058 (https://phabricator.wikimedia.org/T196507) (owner: 10GTirloni)
[20:41:19] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox)
[20:42:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox)
[20:42:52] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.5277 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[20:42:54] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) >>! In T208622#4804221, @Ottomata wrote: > YI think it will involve custom and locked down rsync modules, but we need to puppetize that somehow...
[20:42:54] <wikibugs>	 (03PS4) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052
[20:42:58] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox)
[20:45:05] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Andrew) >>! In T196507#4804320, @Cmjohnson wrote: > @andrew yes, you are correct it is the same exact issue. My goal was to work with one, figure out...
[20:45:36] <XioNoX>	 !log remove codfw/eqdfw avoid path - T194542
[20:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:55] <wikibugs>	 (03PS7) 10Paladox: httpd::mpm: Also remove php7.0 and php7.2 if not prefork [puppet] - 10https://gerrit.wikimedia.org/r/477587
[20:47:56] <wikibugs>	 (03PS8) 10Paladox: httpd::mpm: Also remove mod_php for 7.0 and 7.2 if not prefork [puppet] - 10https://gerrit.wikimedia.org/r/477587
[20:48:30] <wikibugs>	 (03PS9) 10Paladox: httpd::mpm: Also remove mod_php for 7.0 and 7.2 if not prefork [puppet] - 10https://gerrit.wikimedia.org/r/477587 (https://phabricator.wikimedia.org/T208257)
[20:48:39] <wikibugs>	 (03PS10) 10Paladox: httpd::mpm: Also remove mod_php for 7.0 and 7.2 if not prefork [puppet] - 10https://gerrit.wikimedia.org/r/477587 (https://phabricator.wikimedia.org/T208257)
[20:48:47] <gtirloni>	 !log reimaging cloudvirt1019 with stretch T196507
[20:48:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:51] <stashbot>	 T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507
[20:50:32] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.3533 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[20:51:16] <XioNoX>	 !log remove 2 eqiad avoid path - T194542
[20:51:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:42] <wikibugs>	 (03PS5) 10Gehel: elasticsearch: configure LVS endpoint for new codfw clusters [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195)
[20:52:58] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2751 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[20:54:07] <XioNoX>	 gehel: do we have to care about the above ^ ?
[20:54:38] <wikibugs>	 (03CR) 10Dzahn: [C: 031] httpd::mpm: Also remove mod_php for 7.0 and 7.2 if not prefork [puppet] - 10https://gerrit.wikimedia.org/r/477587 (https://phabricator.wikimedia.org/T208257) (owner: 10Paladox)
[20:54:48] <gehel>	 XioNoX: actually, probably yes (cc herron)
[20:55:05] <XioNoX>	 gehel: there seems to be a big uptake on kafka syslogs
[20:55:17] <herron>	 I think I know why
[20:55:20] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.5613 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[20:55:35] <XioNoX>	 good :)
[20:55:39] <gehel>	 XioNoX: it is probably an indication that some service is spewing more logs than usual
[20:56:10] <wikibugs>	 (03PS5) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052
[20:56:11] <XioNoX>	 some service feeling alone and needing to talk?
[20:56:12] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "makes sense to me but since it's a global httpd module change that can influence a lot i just do +1 for now" [puppet] - 10https://gerrit.wikimedia.org/r/477587 (https://phabricator.wikimedia.org/T208257) (owner: 10Paladox)
[20:56:37] <gehel>	 kafka seems to log at NOTICE level, that's probably higher than we want (cc ottomata)
[20:56:44] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch: configure LVS endpoint for new codfw clusters [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) (owner: 10Gehel)
[20:57:11] <gehel>	 herron: if you know, please tell! I'm curious!
[20:57:12] <herron>	 I’ll revert this last patch, rsyslog is complaining about the length of the lines and needs maxmessagesize bumped
[20:57:39] <herron>	 which afaik must be first in the config, so will have to think about how to accomplish that
[20:58:02] <wikibugs>	 (03PS1) 10Herron: Revert "logstash: ship kafka server logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/478060
[20:58:09] <logmsgbot>	 !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@cbe4551]: Install new Updater with INSERT DATA
[20:58:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:44] <wikibugs>	 (03CR) 10Herron: [C: 032] "reverting because rsyslog default max message size is not large enough to handle these, causing a flood of errors logged by rsyslogd" [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) (owner: 10Herron)
[20:59:09] <wikibugs>	 (03CR) 10Herron: [C: 032] Revert "logstash: ship kafka server logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/478060 (owner: 10Herron)
[20:59:17] <wikibugs>	 (03PS2) 10Herron: Revert "logstash: ship kafka server logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/478060
[21:00:06] <gehel>	 herron: kafka still seems to log a lot, 500K messages in the last 15 minutes. I'm pinging people in -analytics
[21:00:22] <XioNoX>	 !log remove 2 esams avoid path + 4 prefered/selected transits - T194542
[21:00:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:21] <nuria>	 gehel: how do those messages look? cause event validation errors are being log and 500K in 15 mins  sounds very possible
[21:02:45] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] profile::phabricator::httpd: Update's worker config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox)
[21:02:49] <gehel>	 nuria: logstash now refuses to show me those message :(
[21:02:56] * gehel is probably doing something wrong
[21:03:44] <gehel>	 nuria: for example: `[2018-07-11 19:44:04,942] INFO [ReplicaFetcher replicaId=1001, leaderId=1003, fetcherId=2] Retrying leaderEpoch request for partition eqiad.change-prop.retry.cpjobqueue.retry.mediawiki.job.LocalPageMoveJob-0 as the leader reported an error: UNKNOWN_SERVER_ERROR (kafka.server.ReplicaFetcherThread)`
[21:03:50] <joal>	 nuria: eventlogging-error rate seems 4 or 5 per second  - A lot less than 500k for 15mins
[21:04:18] <wikibugs>	 (03PS1) 10CDanis: grafana-beta.wikimedia.org: add hiera for text varnishes [puppet] - 10https://gerrit.wikimedia.org/r/478062 (https://phabricator.wikimedia.org/T210416)
[21:04:23] <wikibugs>	 (03PS1) 10Bstorm: sonofgridengine: point to the actual executable for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/478063 (https://phabricator.wikimedia.org/T211258)
[21:04:26] <nuria>	 gehel: ok, that seems indeed like too verbose logging going there 
[21:04:54] <wikibugs>	 (03PS6) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052
[21:05:05] <gehel>	 also, it looks like there is a date in the message itself, and it does not match the log event date
[21:05:10] <dcausse>	 joal: it's something like [2018-11-29 15:19:18,236] INFO Deleted offset index \/srv\/kafka\/data\/webrequest_text-5\/00000000059726012114.index.deleted. (kafka.log.LogSegment)
[21:05:27] <wikibugs>	 (03CR) 10Paladox: profile::phabricator::httpd: Update's worker config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox)
[21:05:41] <dcausse>	 5k lines per second
[21:06:11] <joal>	 dcausse: Seems related to data deletion - 7 days of retention --> 2018-11-29
[21:06:14] <gehel>	 Oh, it looks like those messages go through syslog, so I assume they flow through systemd
[21:06:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "Have to change the review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478034 (owner: 10Shreyasminocha)
[21:06:39] <gehel>	 so there is probably some lag and duplicated heard and wrong parsing
[21:06:54] <herron>	 gehel: these were coming from https://gerrit.wikimedia.org/r/476982 which is using imfile
[21:06:59] <wikibugs>	 (03PS2) 10Bstorm: sonofgridengine: point to the actual executable for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/478063 (https://phabricator.wikimedia.org/T211258)
[21:06:59] <nuria>	 gehel, dcausse :  *I think* that sounds too like verbose logging. We do not use logstash to alram in any kafka functionality at all. Let's disable  loging to logstash completely if it is a nuisance
[21:07:08] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[21:07:08] <wikibugs>	 (03PS7) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052
[21:07:14] <herron>	 there isn’t parsing on the message contents, but we can add it
[21:07:27] <logmsgbot>	 !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@cbe4551]: Install new Updater with INSERT DATA (duration: 09m 18s)
[21:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:43] <gehel>	 herron: I'm wondering if going through journald isn't more pain than it is worth
[21:08:00] <herron>	 in terms of extracting fields on those log types.  they are going rsyslog -> kafka -> logstash
[21:08:12] <herron>	 multi-line logs make it tricky
[21:08:33] <nuria>	 gehel: ottomata is not here but i think until  tomorrow we can do w/o any logstash logging , we really do not use it at all
[21:08:34] <gehel>	 those messages are structured in whatever logging framework kafka uses, so serializing to text and running grok on that seems more error prone
[21:09:14] <gehel>	 herron: we should at least serialize that to json instead of text
[21:09:49] <gehel>	 herron / nuria: do you know how to disable the kafka logs for the time being?
[21:10:16] <nuria>	 gehel: looking, this is the ticket to enable them, give me a sec: https://phabricator.wikimedia.org/T205437
[21:10:18] <herron>	 they are, reverted the change a short while ago and it’s propagating out
[21:10:31] <gehel>	 herron: thansk!
[21:10:32] <dcausse>	 I no longer see those in logstash 
[21:10:32] <herron>	 they are disabled that is
[21:11:08] <wikibugs>	 (03CR) 1020after4: [C: 031] phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) (owner: 10Paladox)
[21:11:10] <XioNoX>	 once the root cause is solved, shouldn't a new icinga/grafana check be added to catch the issue earlier instead of relying on UDP packets drops?
[21:11:26] <wikibugs>	 (03PS8) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052
[21:12:01] <herron>	 XioNoX: the rsyslog line length issue?
[21:12:02] <wikibugs>	 (03PS4) 10Paladox: phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353)
[21:12:07] <wikibugs>	 (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) (owner: 10Paladox)
[21:12:27] <wikibugs>	 (03CR) 10BBlack: [C: 031] "Looks sane to a human!" [puppet] - 10https://gerrit.wikimedia.org/r/478062 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis)
[21:12:42] <nuria>	 herron: can you point me to the change you reverted?
[21:12:58] <XioNoX>	 no idea, I'm wondering if there is a way to catch similar issues in the future before it cause packets drop
[21:13:31] <herron>	 nuria: sure, it was https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476982/
[21:14:15] <wikibugs>	 (03PS1) 10CDanis: grafana-beta.wikimedia.org: add DNS entry for text varnishes [dns] - 10https://gerrit.wikimedia.org/r/478067 (https://phabricator.wikimedia.org/T210416)
[21:15:13] <wikibugs>	 (03PS2) 10CDanis: grafana-beta.wikimedia.org: add hiera for text varnishes [puppet] - 10https://gerrit.wikimedia.org/r/478062 (https://phabricator.wikimedia.org/T210416)
[21:15:43] <wikibugs>	 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10ayounsi)
[21:16:02] <wikibugs>	 (03CR) 10CDanis: [C: 032] grafana-beta.wikimedia.org: add hiera for text varnishes [puppet] - 10https://gerrit.wikimedia.org/r/478062 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis)
[21:16:12] <nuria>	 herron: thank you, have reopened our original ticket 
[21:19:21] <herron>	 np!
[21:19:30] <herron>	 XioNoX: in a nutshell this is the reason for migrating to the kafka logging pipeline
[21:19:49] <XioNoX>	 cool :)
[21:19:49] <herron>	 when complete we’ll be able to turn down udp
[21:20:13] <XioNoX>	 as long as you don't drop my precious packets :)
[21:20:16] <herron>	 and there will be much celebration
[21:20:17] <herron>	 hah
[21:20:28] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella)
[21:20:56] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella)
[21:21:18] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) Oh that is cool, thanks!    > > >
[21:21:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Updated InitialiseSettings.php for HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella)
[21:23:32] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[21:27:16] <wikibugs>	 (03CR) 10BBlack: [C: 031] grafana-beta.wikimedia.org: add DNS entry for text varnishes [dns] - 10https://gerrit.wikimedia.org/r/478067 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis)
[21:27:28] <wikibugs>	 (03PS3) 10Bstorm: sonofgridengine: point to the actual executable for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/478063 (https://phabricator.wikimedia.org/T211258)
[21:28:05] <wikibugs>	 (03CR) 10CDanis: [C: 032] grafana-beta.wikimedia.org: add DNS entry for text varnishes [dns] - 10https://gerrit.wikimedia.org/r/478067 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis)
[21:29:16] <wikibugs>	 (03CR) 10Bstorm: [C: 032] sonofgridengine: point to the actual executable for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/478063 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm)
[21:29:25] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2001.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:29:43] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2002.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:30:08] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) >>! In T211023#4804503, @ops-monitoring-bot wrote: > wmf-decommission-host was executed by robh for elastic2002.codfw.wmnet and performed the following actio...
[21:30:36] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10GTirloni) Stretch did not help, battery continues showing as recharging.  ` Smart Array P440ar in Slot 0 (Embedded)    Cache Serial Number: PDNLH0BRH8...
[21:30:42] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2003.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:30:56] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2004.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:31:07] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2005.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:31:22] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2006.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:31:41] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2007.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:32:36] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [changeprop/deploy@f675fcc]: Added performer to the revision-scores event
[21:32:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:43] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[21:32:43] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2008.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:32:53] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2009.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:33:43] <wikibugs>	 (03PS1) 10GTirloni: cloudvirt1019: reimage with Jessie [puppet] - 10https://gerrit.wikimedia.org/r/478098 (https://phabricator.wikimedia.org/T196507)
[21:33:47] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2010.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:33:51] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@f675fcc]: Added performer to the revision-scores event (duration: 01m 15s)
[21:33:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:57] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2011.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:34:07] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2012.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:34:20] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2013.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:34:28] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2014.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:34:35] <wikibugs>	 (03CR) 10GTirloni: [C: 032] cloudvirt1019: reimage with Jessie [puppet] - 10https://gerrit.wikimedia.org/r/478098 (https://phabricator.wikimedia.org/T196507) (owner: 10GTirloni)
[21:34:37] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2015.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:34:45] <wikibugs>	 10Operations, 10ops-eqiad: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10Cmjohnson) Some cameras have been re-connected as I am in their racks, others will need me to run new cables to reach the new switches.  Some progress as I get the chance.  Front of Cage Camera Rows A/B ->...
[21:34:47] <robh>	 so yeah, that script is going to spam this channel 24 times total.
[21:34:49] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2016.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:35:00] <robh>	 once per host.
[21:35:01] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2017.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:35:10] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2018.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:35:15] <paladox>	 robh and your going to be pinged 24 times :P
[21:35:20] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2019.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:35:30] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2020.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:35:39] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received
[21:35:40] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2021.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:35:51] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2022.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:35:56] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) @Banyek another Q:  Can we add permissions to the recommendationapi user on m2-master to be able to connect from stat1007?  This might not be...
[21:36:02] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2023.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:36:29] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2024.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo...
[21:37:55] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[21:38:27] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) For posterity, I did the following from neodymium for Baho:  ` python3 deploy.py import_languages 20181130 m2-master.eqiad.wmnet 3306 recomme...
[21:39:25] <gtirloni>	 !log reimaging cloudvirt1019 with jessie T196507
[21:39:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:29] <stashbot>	 T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507
[21:40:27] <wikibugs>	 (03PS1) 10CDanis: Temporarily override grafan1001's HTTP serving domain to grafana-beta.wikimedia.org.  Once we are happy with the migration we can re-point grafana.w.o Varnishes to it and simply remove this file. [puppet] - 10https://gerrit.wikimedia.org/r/478099 (https://phabricator.wikimedia.org/T210416)
[21:40:52] <wikibugs>	 (03CR) 10Ottomata: [C: 032] EventLogging Logstash filter: move useful fields out of event [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza)
[21:40:59] <wikibugs>	 (03PS4) 10Ottomata: EventLogging Logstash filter: move useful fields out of event [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza)
[21:41:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Temporarily override grafan1001's HTTP serving domain to grafana-beta.wikimedia.org.  Once we are happy with the migration we can re-point grafana.w.o Varnishes to it and simply remove this file. [puppet] - 10https://gerrit.wikimedia.org/r/478099 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis)
[21:41:15] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to Proton for pmiazga, bearND, Mholloway, MSantos, Tgr - https://phabricator.wikimedia.org/T211382 (10pmiazga)
[21:41:27] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] EventLogging Logstash filter: move useful fields out of event [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza)
[21:41:52] <wikibugs>	 (03PS2) 10CDanis: grafana1001: answer for grafana-beta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/478099 (https://phabricator.wikimedia.org/T210416)
[21:42:55] <wikibugs>	 (03PS3) 10CDanis: grafana1001: answer for grafana-beta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/478099 (https://phabricator.wikimedia.org/T210416)
[21:43:22] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to Proton for pmiazga, bearND, Mholloway, MSantos, Tgr - https://phabricator.wikimedia.org/T211382 (10pmiazga)
[21:43:25] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) asw-a-codfw:ge-5/0/8 elastic2001 asw-a-codfw:ge-5/0/20 elastic2002 asw-a-codfw:ge-5/0/21 elastic2003 asw-a-codfw:ge-8/0/3 elastic2004 asw-a-codfw:ge-8/0/4 el...
[21:43:30] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[21:44:27] <wikibugs>	 (03CR) 10ArielGlenn: [C: 031] "Um a quick note about the commit message, it's the labstore boxes (dumps web/nfs servers), not the labsdbs that are involved." [puppet] - 10https://gerrit.wikimedia.org/r/478022 (https://phabricator.wikimedia.org/T211330) (owner: 10Elukey)
[21:44:39] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy
[21:45:09] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received
[21:45:37] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to Proton for pmiazga, bearND, Mholloway, MSantos, Tgr - https://phabricator.wikimedia.org/T211382 (10pmiazga) @bearND @Mholloway @MSantos @Tgr could you edit the task and put your shell usernames here please?  @mobrovac could you approve the request? Also...
[21:50:15] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy
[21:50:32] <wikibugs>	 10Operations, 10Cloud-Services, 10Patch-For-Review: Setting up bulk proxies pointing to a multiwiki mediawiki-vagrant setup running on a labs vm - https://phabricator.wikimedia.org/T132216 (10bd808) 05Open>03Resolved a:03bd808 Closing this out. 2.5 years with no updates so... yeah.
[21:50:36] <wikibugs>	 (03CR) 10CDanis: [C: 032] grafana1001: answer for grafana-beta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/478099 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis)
[21:53:14] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] "Wow, my claim about the raw event sometimes being JSON was wrong.  It shoudl be though!  Fixing: https://gerrit.wikimedia.org/r/#/c/eventl" [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza)
[21:56:33] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 82.23 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen
[21:56:50] <wikibugs>	 (03PS1) 10EBernhardson: Turn off wbsearchentities test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478103 (https://phabricator.wikimedia.org/T209402)
[22:05:00] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to Proton for pmiazga, bearND, Mholloway, MSantos, Tgr - https://phabricator.wikimedia.org/T211382 (10pmiazga)
[22:09:49] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH)
[22:11:02] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to Proton for pmiazga, bearND, Mholloway, MSantos, Tgr - https://phabricator.wikimedia.org/T211382 (10Mholloway)
[22:12:09] <urandom>	 !log decommissioning cassandra-b, restbase2001 -- T210843
[22:12:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:13] <stashbot>	 T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843
[22:13:43] <wikibugs>	 (03PS1) 10RobH: decom elastic2001-2024 [puppet] - 10https://gerrit.wikimedia.org/r/478105 (https://phabricator.wikimedia.org/T211023)
[22:14:30] <wikibugs>	 (03CR) 10RobH: [C: 032] decom elastic2001-2024 [puppet] - 10https://gerrit.wikimedia.org/r/478105 (https://phabricator.wikimedia.org/T211023) (owner: 10RobH)
[22:15:15] <wikibugs>	 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH)
[22:17:10] <wikibugs>	 (03PS1) 10RobH: decom elastic2001-2024 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/478106 (https://phabricator.wikimedia.org/T211023)
[22:17:49] <wikibugs>	 (03CR) 10RobH: [C: 032] decom elastic2001-2024 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/478106 (https://phabricator.wikimedia.org/T211023) (owner: 10RobH)
[22:21:33] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH)
[22:22:08] <wikibugs>	 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans)
[22:23:28] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) a:05RobH>03Papaul Ok, these are now ready for SSD wipe.   Please note, since they are SSDs, a wipe (write zeros) won't work, and the hdparm ut...
[22:49:44] <wikibugs>	 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10cscott) If you could provide more details, I'd certainly be interested in helping debug the XPath library in...
[22:54:22] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) p:05Normal>03High
[22:54:48] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH)
[22:55:05] <wikibugs>	 (03PS2) 10Stella: Updated InitialiseSettings.php for HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618)
[22:58:45] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@be8f0c0]: Add 'morelike' recommendation public API specification T201192
[22:58:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:58:49] <stashbot>	 T201192: Build API to surface  'morelike' article recommendations for missing articles - https://phabricator.wikimedia.org/T201192
[23:06:19] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack: monitor nova and kvm on cloudvirt hosts [puppet] - 10https://gerrit.wikimedia.org/r/478113 (https://phabricator.wikimedia.org/T211388)
[23:10:49] <wikibugs>	 (03PS1) 10Dzahn: interface: use new data type Stdlib::Ip_address [puppet] - 10https://gerrit.wikimedia.org/r/478114
[23:11:43] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) So, as of December 6th, there are no new emails to EquinixMaintenance.SG@ap.equinix.com since our last email/update request to Vivian.  Once...
[23:12:48] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack: monitor nova and kvm on cloudvirt hosts [puppet] - 10https://gerrit.wikimedia.org/r/478113 (https://phabricator.wikimedia.org/T211388)
[23:12:50] <wikibugs>	 (03PS1) 10Andrew Bogott: Disable alerting on cloudvirt1019 and 1020 [puppet] - 10https://gerrit.wikimedia.org/r/478115 (https://phabricator.wikimedia.org/T196507)
[23:15:24] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella)
[23:16:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Disable alerting on cloudvirt1019 and 1020 [puppet] - 10https://gerrit.wikimedia.org/r/478115 (https://phabricator.wikimedia.org/T196507) (owner: 10Andrew Bogott)
[23:17:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella)
[23:21:32] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@be8f0c0]: Add 'morelike' recommendation public API specification T201192 (duration: 22m 46s)
[23:21:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:36] <stashbot>	 T201192: Build API to surface  'morelike' article recommendations for missing articles - https://phabricator.wikimedia.org/T201192
[23:25:02] <wikibugs>	 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Done with CPT), and 2 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10bmansurov)
[23:25:42] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116
[23:26:24] <wikibugs>	 10Operations, 10DBA, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10bmansurov)
[23:29:35] <wikibugs>	 (03PS2) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116
[23:30:00] <wikibugs>	 (03CR) 10Paladox: gerrit: add data types for all parameters (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn)
[23:30:14] <wikibugs>	 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov)
[23:37:08] <wikibugs>	 (03PS3) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116
[23:37:10] <wikibugs>	 (03CR) 10Dzahn: gerrit: add data types for all parameters (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn)
[23:38:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn)
[23:38:13] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis: Upgrade grafana to 5.x - https://phabricator.wikimedia.org/T210416 (10CDanis)
[23:43:43] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "the syntax errors should be gone after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475261/  needs to wait a bit.. don't want to" [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn)
[23:45:45] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [recommendation-api/deploy@299b268]: Add 'morelike' article recommendations API T201192
[23:45:47] <wikibugs>	 10Operations, 10Icinga, 10Patch-For-Review: move icinga contacts file to public repo - https://phabricator.wikimedia.org/T164238 (10Dzahn) a:05Dzahn>03None
[23:45:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:45:48] <stashbot>	 T201192: Build API to surface  'morelike' article recommendations for missing articles - https://phabricator.wikimedia.org/T201192
[23:47:03] <wikibugs>	 (03Abandoned) 10Dzahn: icinga: test creating individual contact secrets [puppet] - 10https://gerrit.wikimedia.org/r/391980 (https://phabricator.wikimedia.org/T164238) (owner: 10Dzahn)
[23:47:15] <XioNoX>	 !log troubleshoot bird bfd on dns2001/cr1-codfw
[23:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:51] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [recommendation-api/deploy@299b268]: Add 'morelike' article recommendations API T201192 (duration: 02m 06s)
[23:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:38] <wikibugs>	 (03PS1) 10Dzahn: nagios_common: remove commented section about contacts test file [puppet] - 10https://gerrit.wikimedia.org/r/478118 (https://phabricator.wikimedia.org/T164238)
[23:48:55] <wikibugs>	 10Operations, 10Icinga, 10monitoring: move icinga contacts file to public repo - https://phabricator.wikimedia.org/T164238 (10Dzahn)
[23:49:34] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "just cleaning up what i added back in 2017 and is commented" [puppet] - 10https://gerrit.wikimedia.org/r/478118 (https://phabricator.wikimedia.org/T164238) (owner: 10Dzahn)
[23:50:51] <wikibugs>	 10Operations, 10MediaWiki-Debug-Logger, 10Performance-Team: Set up request profiling for PHP 7 - https://phabricator.wikimedia.org/T206152 (10tstarling) Please install tideways, but it should only be enabled in php.ini on the debug servers, since it will cause a performance degradation even without being use...
[23:53:15] <wikibugs>	 (03PS2) 10Dzahn: ci::website: convert apache to httpd [puppet] - 10https://gerrit.wikimedia.org/r/453554
[23:58:06] <mutante>	 ste1la: welcome, new contributor