[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:06:20] (03CR) 10CRusnov: "> Patch Set 4:" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [00:10:28] !log bootstrapping cassandra-c, restbase2018 -- T210843 [00:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:31] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [00:11:11] RECOVERY - Check systemd state on restbase2018 is OK: OK - running: The system is fully operational [00:11:22] (03PS1) 10Tim Starling: Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 [00:11:35] RECOVERY - cassandra-c service on restbase2018 is OK: OK - cassandra-c is active [00:11:57] RECOVERY - cassandra-c SSL 10.192.48.126:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-c valid until 2020-11-29 09:26:22 +0000 (expires in 724 days) [00:12:27] (03CR) 10jerkins-bot: [V: 04-1] Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [00:12:31] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [00:13:18] !log re-enabling puppet on phabricator, applying change that adds php-fpm support on stretch ..which doesnt affect phab1001 (prod) on jessie.. BUT re-adds tuning config from the past for mpm_prefork.conf (more SpareServers etc) that was not actually applied due to a bug [00:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:21] (03PS1) 10Bstorm: sonofgridengine: grant more control over shadowd using the env settings [puppet] - 10https://gerrit.wikimedia.org/r/477940 (https://phabricator.wikimedia.org/T211258) [00:15:32] !log MPM prefork tweaks for high load systems are applied again (apparently they were not since a change in the past that resulted in 2 competing configs in mods-enabled and conf-enabled with the latter one being loaded last and containing the package defaults [00:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:42] (03CR) 10Bstorm: [C: 032] sonofgridengine: grant more control over shadowd using the env settings [puppet] - 10https://gerrit.wikimedia.org/r/477940 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [00:18:08] (03CR) 10Tim Starling: "Is there a reason we are running a PHP 5.5 lint against this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [00:20:11] (03CR) 10Tim Starling: [C: 032] "I'm thinking about ways to abstract env/services in the context of https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/477939" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle) [00:20:25] (03PS5) 10Paladox: profile::phabricator::httpd: Fix worker configs and also use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925 [00:21:22] (03Merged) 10jenkins-bot: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle) [00:22:31] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Dzahn) While merging and confirming https://gerrit.wiki... [00:25:47] (03PS6) 10Paladox: profile::phabricator::httpd: Fix worker configs and also use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925 [00:25:53] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/477925 (owner: 10Paladox) [00:27:47] (03CR) 10jenkins-bot: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle) [00:31:24] (03CR) 10Tim Starling: [C: 032] "I want to use this right now to test https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/467239/ which I just merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle) [00:32:30] (03Merged) 10jenkins-bot: Create script to intentionally trigger fatal errors in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle) [00:36:23] (03PS6) 10CRusnov: Add an old hardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) [00:37:30] (03CR) 10Volans: [C: 04-1] "Code and compiler both looks good. Just a couple of minor details inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477845 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [00:38:14] !log tstarling@deploy1001 Synchronized private/FatalErrorSettings.php: (no justification provided) (duration: 00m 46s) [00:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:19] (03PS7) 10CRusnov: Add an old hardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) [00:40:52] (03CR) 10jenkins-bot: Create script to intentionally trigger fatal errors in MediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477450 (https://phabricator.wikimedia.org/T210567) (owner: 10BPirkle) [00:40:56] !log tstarling@deploy1001 Synchronized private/FatalErrorSettings.php: (no justification provided) (duration: 00m 46s) [00:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:15] !log tstarling@deploy1001 Synchronized w/fatal-error.php: (no justification provided) (duration: 00m 46s) [00:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:01] (03CR) 10Volans: [C: 031] "LGTM. I agree with Faidon, better run this at least once and then decide based on the results what we want to do." (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [00:50:12] (03PS1) 10Tim Starling: In fatal-error.php remove incorrect $wiki parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477944 [00:52:16] (03CR) 10Tim Starling: [C: 032] In fatal-error.php remove incorrect $wiki parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477944 (owner: 10Tim Starling) [00:52:34] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) Please note the captive nuts for this arrived to my place today, and the brackets are on site. I can now start on this process of swapping the PDUS over. I'll go in... [00:53:20] (03Merged) 10jenkins-bot: In fatal-error.php remove incorrect $wiki parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477944 (owner: 10Tim Starling) [00:55:07] !log tstarling@deploy1001 Synchronized w/fatal-error.php: (no justification provided) (duration: 00m 47s) [00:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:47] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) [00:59:35] (03PS1) 10Tim Starling: Constants must be scalar in HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477945 [01:00:04] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T0100). [01:01:09] (03CR) 10jenkins-bot: In fatal-error.php remove incorrect $wiki parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477944 (owner: 10Tim Starling) [01:01:24] (03CR) 10Tim Starling: [C: 032] Constants must be scalar in HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477945 (owner: 10Tim Starling) [01:02:27] (03Merged) 10jenkins-bot: Constants must be scalar in HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477945 (owner: 10Tim Starling) [01:03:29] !log tstarling@deploy1001 Synchronized w/fatal-error.php: (no justification provided) (duration: 00m 46s) [01:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:19] (03CR) 10Tim Starling: [C: 032] "I tested this in production by inducing 10 fatal errors with /w/fatal-error.php and verifying that the count went up accordingly in Grafan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle) [01:21:48] (03CR) 10jenkins-bot: Constants must be scalar in HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477945 (owner: 10Tim Starling) [02:02:25] RECOVERY - cassandra-c CQL 10.192.48.126:9042 on restbase2018 is OK: TCP OK - 0.036 second response time on 10.192.48.126 port 9042 [02:17:11] PROBLEM - Apache HTTP on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:17:21] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [02:17:23] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [02:17:23] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:25] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:25] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:29] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [02:17:31] PROBLEM - HTTP-dbtree on dbmonitor2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:17:31] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:31] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:33] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:37] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [02:17:37] PROBLEM - DPKG on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:17:43] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:43] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:45] PROBLEM - Check systemd state on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer [02:17:45] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:49] PROBLEM - dhclient process on proton2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.75: Connection reset by peer [02:17:51] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [02:17:51] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [02:17:51] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:55] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:17:57] PROBLEM - Check systemd state on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer [02:17:59] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:01] PROBLEM - HHVM rendering on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:05] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [02:18:07] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:09] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:11] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:11] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:11] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:11] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:13] PROBLEM - PHP7 rendering on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:15] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:19] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:19] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:23] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:18:25] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [02:18:31] PROBLEM - Check whether ferm is active by checking the default input chain on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:18:31] PROBLEM - Disk space on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer [02:18:33] PROBLEM - puppet last run on mw2264 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:18:55] PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer [02:18:55] PROBLEM - Check size of conntrack table on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer [02:19:05] PROBLEM - ganeti-mond running on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer [02:19:25] PROBLEM - dhclient process on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer [02:19:27] PROBLEM - Check systemd state on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer [02:19:27] PROBLEM - Check size of conntrack table on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:19:39] PROBLEM - mcrouter process on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:19:42] PROBLEM - Nginx local proxy to apache on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:42] PROBLEM - Disk space on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:19:45] PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer [02:19:53] PROBLEM - puppet last run on conf2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:19:59] PROBLEM - configured eth on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:20:01] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer [02:20:09] PROBLEM - HHVM processes on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:20:13] PROBLEM - Check systemd state on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer [02:20:17] PROBLEM - php7.2-fpm service on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:20:21] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer [02:20:23] PROBLEM - DPKG on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer [02:20:25] PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:21:05] PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:21:07] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer [02:21:23] PROBLEM - Check whether ferm is active by checking the default input chain on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer [02:21:29] PROBLEM - Check size of conntrack table on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer [02:21:31] PROBLEM - puppet last run on mw2261 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:21:43] PROBLEM - DPKG on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer [02:21:47] PROBLEM - puppet last run on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer [02:21:59] PROBLEM - Check systemd state on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer [02:22:09] PROBLEM - mcrouter process on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:22:13] PROBLEM - Disk space on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer [02:22:25] PROBLEM - Check size of conntrack table on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer [02:22:25] PROBLEM - dhclient process on dbmonitor2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.52: Connection reset by peer [02:22:31] PROBLEM - configured eth on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:22:33] PROBLEM - ganeti-noded running on ganeti2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.127: Connection reset by peer [02:22:35] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer [02:22:35] PROBLEM - DPKG on mwdebug2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.66: Connection reset by peer [02:22:41] PROBLEM - configured eth on pybal-test2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.139: Connection reset by peer [02:22:51] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.50: Connection reset by peer [02:22:53] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:23:09] RECOVERY - Check systemd state on alsafi is OK: OK - running: The system is fully operational [02:23:09] RECOVERY - dhclient process on ganeti2003 is OK: PROCS OK: 0 processes with command name dhclient [02:23:11] RECOVERY - Check size of conntrack table on mwdebug2002 is OK: OK: nf_conntrack is 0 % full [02:23:17] RECOVERY - Apache HTTP on mwdebug2002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.116 second response time [02:23:17] RECOVERY - Nginx local proxy to apache on mwdebug2002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.204 second response time [02:23:17] PROBLEM - Check systemd state on pybal-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:23:17] RECOVERY - mcrouter process on mwdebug2002 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter [02:23:19] RECOVERY - Disk space on dbmonitor2001 is OK: DISK OK [02:23:19] RECOVERY - Disk space on mwdebug2002 is OK: DISK OK [02:23:23] RECOVERY - Check whether ferm is active by checking the default input chain on mwdebug2002 is OK: OK ferm input default policy is set [02:23:23] RECOVERY - Disk space on pybal-test2001 is OK: DISK OK [02:23:23] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [02:23:29] RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational [02:23:29] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set [02:23:29] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [02:23:31] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [02:23:31] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [02:23:35] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [02:23:35] RECOVERY - dhclient process on dbmonitor2001 is OK: PROCS OK: 0 processes with command name dhclient [02:23:35] RECOVERY - Check size of conntrack table on ganeti2003 is OK: OK: nf_conntrack is 0 % full [02:23:35] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [02:23:37] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [02:23:37] RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 78891 bytes in 1.076 second response time [02:23:39] RECOVERY - configured eth on mwdebug2002 is OK: OK - interfaces up [02:23:41] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy [02:23:41] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy [02:23:41] RECOVERY - ganeti-noded running on ganeti2003 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [02:23:43] RECOVERY - Check whether ferm is active by checking the default input chain on dbmonitor2001 is OK: OK ferm input default policy is set [02:23:43] RECOVERY - DPKG on alsafi is OK: All packages OK [02:23:43] RECOVERY - DPKG on mwdebug2002 is OK: All packages OK [02:23:44] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [02:23:45] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [02:23:47] RECOVERY - configured eth on pybal-test2001 is OK: OK - interfaces up [02:23:49] RECOVERY - Check size of conntrack table on dbmonitor2001 is OK: OK: nf_conntrack is 0 % full [02:23:49] RECOVERY - Check size of conntrack table on pybal-test2001 is OK: OK: nf_conntrack is 0 % full [02:23:49] RECOVERY - HHVM processes on mwdebug2002 is OK: PROCS OK: 6 processes with command name hhvm [02:23:49] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy [02:23:53] RECOVERY - Check systemd state on ganeti2003 is OK: OK - running: The system is fully operational [02:23:57] RECOVERY - dhclient process on proton2002 is OK: PROCS OK: 0 processes with command name dhclient [02:23:57] RECOVERY - php7.2-fpm service on mwdebug2002 is OK: OK - php7.2-fpm is active [02:23:59] RECOVERY - ganeti-mond running on ganeti2003 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond [02:23:59] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [02:23:59] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [02:23:59] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [02:24:01] RECOVERY - DPKG on pybal-test2001 is OK: All packages OK [02:24:03] RECOVERY - Check systemd state on dbmonitor2001 is OK: OK - running: The system is fully operational [02:24:03] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [02:24:05] RECOVERY - HHVM rendering on mwdebug2002 is OK: HTTP OK: HTTP/1.1 200 OK - 76170 bytes in 0.304 second response time [02:24:05] RECOVERY - DPKG on dbmonitor2001 is OK: All packages OK [02:24:11] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy [02:24:11] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [02:24:13] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [02:24:15] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [02:24:17] RECOVERY - PHP7 rendering on mwdebug2002 is OK: HTTP OK: HTTP/1.1 200 OK - 76210 bytes in 0.266 second response time [02:24:19] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [02:24:19] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [02:24:19] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [02:24:23] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [02:24:23] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [02:24:29] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [02:24:31] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy [02:24:35] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [02:25:05] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy [02:25:09] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [02:25:09] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [02:25:15] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [02:26:53] RECOVERY - puppet last run on ganeti2003 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures [02:37:54] eh? [02:39:04] the citoid/restbase ones were semi-expected, it's been flapping since switch to kubernetes, Marko&Alex are working on that [02:39:19] RECOVERY - Check systemd state on pybal-test2001 is OK: OK - running: The system is fully operational [02:40:10] the mwdebug ones I do not know about [02:41:44] thanks Pchelolo [02:46:03] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:46:33] RECOVERY - puppet last run on es2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:47:37] RECOVERY - puppet last run on mw2261 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:48:59] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:49:43] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:49:55] RECOVERY - puppet last run on mw2264 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:56:42] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [02:56:53] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [03:09:16] 10Operations, 10Community-Tech, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Mathew.onipe) [03:10:33] 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Mathew.onipe) [03:15:18] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, 10Multimedia: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T211270 (10Mathew.onipe) [03:21:25] 10Operations, 10Beta-Cluster-Infrastructure, 10Cloud-Services: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211271 (10Mathew.onipe) [03:23:04] 10Operations, 10Beta-Cluster-Infrastructure, 10Cloud-Services: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211271 (10Mathew.onipe) p:05Triage>03Normal [03:25:40] 10Operations, 10MediaWiki-API: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211272 (10Mathew.onipe) [03:26:10] 10Operations, 10MediaWiki-API: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211272 (10Mathew.onipe) p:05Triage>03Normal [03:28:18] 10Operations, 10MediaWiki-Debug-Logger: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211273 (10Mathew.onipe) [03:28:24] 10Operations, 10MediaWiki-Debug-Logger: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211273 (10Mathew.onipe) p:05Triage>03Normal [03:37:17] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 886.59 seconds [03:44:54] (03PS2) 10Tim Starling: Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 [03:44:56] (03PS1) 10Tim Starling: Class wrapper for ProductionServices.php etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 [03:44:58] (03PS1) 10Tim Starling: Put profiler hostnames in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 [03:45:57] (03CR) 10jerkins-bot: [V: 04-1] Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [03:46:40] (03CR) 10Mathew.onipe: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe) [03:47:00] (03PS2) 10Mathew.onipe: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) [03:54:06] (03PS1) 10Mathew.onipe: setup: change curator version to 4.2.5 to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 [04:56:05] (03PS1) 10Stella: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618) [04:56:22] (03CR) 10jerkins-bot: [V: 04-1] Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella) [04:58:44] (03PS2) 10Stella: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618) [05:19:45] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 187.30 seconds [05:27:21] 10Operations, 10MediaWiki-API: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211272 (10Legoktm) @Mathew.onipe any reason you added #mediawiki-api ? I don't see the relevance. [05:31:54] 10Operations: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211272 (10Mathew.onipe) [05:32:26] 10Operations: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211272 (10Mathew.onipe) My bad. Removed. Thanks [05:47:58] (03PS5) 10Fomafix: Add language codes sr-cyrl and sr-latn next to sr-ec and sr-el [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845) [06:12:48] (03CR) 10Legoktm: "> LGTM, not merging right now as I don't know the status of the related code in prod, but feel free to ping if you need it merged." [puppet] - 10https://gerrit.wikimedia.org/r/475579 (https://phabricator.wikimedia.org/T210312) (owner: 10Legoktm) [06:29:33] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:33:41] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10User-ArielGlenn: Correctly collect logs from php-fpm pools - https://phabricator.wikimedia.org/T211184 (10Joe) [06:39:21] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:40:42] 10Operations, 10Traffic, 10HTTPS, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Liuxinyu970226) @Krenair @Bawolff @jcrespo Wondering if we can enable QUIC support on our server clusters instead? I've heard that the [[https://github.com/googlehosts|Googl... [06:58:08] (03CR) 10Elukey: "Going to delete these crons since they are not used anymore (see the ensure => absent) and they are confusing, feel free to drop this code" [puppet] - 10https://gerrit.wikimedia.org/r/477818 (owner: 10Fdans) [07:00:50] (03PS1) 10Elukey: profile::analytics::refinery::job::sqoop_mediawiki: remove crons [puppet] - 10https://gerrit.wikimedia.org/r/477962 [07:12:52] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211278 (10ops-monitoring-bot) [07:25:15] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211279 (10ops-monitoring-bot) [07:29:41] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::sqoop_mediawiki: remove crons [puppet] - 10https://gerrit.wikimedia.org/r/477962 (owner: 10Elukey) [07:34:12] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211281 (10ops-monitoring-bot) [07:35:08] ACKNOWLEDGEMENT - Device not healthy -SMART- on stat1004 is CRITICAL: cluster=analytics device=sde instance=stat1004:9100 job=node site=eqiad Muehlenhoff /dev/sde is a USB drive which is temporarily attached and which smartctl cant parse with its default settings. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004var-datasource=eqiad%2520prometheus%252Fops [07:38:33] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211282 (10ops-monitoring-bot) [07:40:04] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [07:41:43] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211283 (10ops-monitoring-bot) [07:42:33] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [07:45:34] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211284 (10ops-monitoring-bot) [07:48:10] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211285 (10ops-monitoring-bot) [07:53:19] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211286 (10ops-monitoring-bot) [07:55:52] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211287 (10ops-monitoring-bot) [07:58:21] (03PS2) 10Muehlenhoff: Remove Diamond on maps servers [puppet] - 10https://gerrit.wikimedia.org/r/477748 (https://phabricator.wikimedia.org/T183454) [08:03:28] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211288 (10ops-monitoring-bot) [08:03:51] PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:55] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time [08:09:17] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211289 (10ops-monitoring-bot) [08:12:19] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211290 (10ops-monitoring-bot) [08:17:22] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211291 (10ops-monitoring-bot) [08:18:50] PROBLEM - puppet last run on an-worker1083 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[spark2_yarn_shuffle_jar_install],Package[hadoop-client],Package[libhdfs0] [08:19:23] (03PS1) 10Elukey: profile::hadoop::spark2: explicitly require hadoop-yarn-nodemanager [puppet] - 10https://gerrit.wikimedia.org/r/477965 [08:20:33] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211292 (10ops-monitoring-bot) [08:21:36] (03CR) 10Elukey: [C: 032] profile::hadoop::spark2: explicitly require hadoop-yarn-nodemanager [puppet] - 10https://gerrit.wikimedia.org/r/477965 (owner: 10Elukey) [08:24:50] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211293 (10ops-monitoring-bot) [08:25:10] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:26:01] this is my fault --^ [08:28:12] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:28:56] RECOVERY - puppet last run on an-worker1083 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:30:16] (03PS3) 10Muehlenhoff: Remove Diamond on maps servers [puppet] - 10https://gerrit.wikimedia.org/r/477748 (https://phabricator.wikimedia.org/T183454) [08:30:50] 10Operations, 10Traffic, 10HTTPS, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Bawolff) >>! In T205378#4802463, @Liuxinyu970226 wrote: > @Krenair @Bawolff @jcrespo Wondering if we can enable QUIC support on our server clusters instead? I've heard that... [08:31:35] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211295 (10ops-monitoring-bot) [08:31:57] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond on maps servers [puppet] - 10https://gerrit.wikimedia.org/r/477748 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [08:36:32] (03PS1) 10Elukey: profile::hadoop::spark2: fix dependency on exec [puppet] - 10https://gerrit.wikimedia.org/r/477968 [08:37:40] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211296 (10ops-monitoring-bot) [08:39:03] (03CR) 10Elukey: [C: 032] profile::hadoop::spark2: fix dependency on exec [puppet] - 10https://gerrit.wikimedia.org/r/477968 (owner: 10Elukey) [08:43:44] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:45:50] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:48:56] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:49:54] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211298 (10ops-monitoring-bot) [08:54:34] (03CR) 10Urbanecm: [C: 04-1] "Due to SWAT policies, InitialiseSettings.php must be updated in another patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella) [08:57:39] 10Operations, 10Icinga, 10Scoring-platform-team, 10Patch-For-Review: Add ahalfaker to ORES-related icinga contacts - https://phabricator.wikimedia.org/T210742 (10Halfak) Thank you! [08:59:41] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:00:08] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211299 (10ops-monitoring-bot) [09:03:20] (03CR) 10Muehlenhoff: "I think there's been some general confusion where the metrics are supposed to end up: So far labmon has been mostly used to collect metric" [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite) [09:06:28] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211300 (10ops-monitoring-bot) [09:10:04] (03PS1) 10Elukey: profile::hadoop::common: remove explicit dep ordering [puppet] - 10https://gerrit.wikimedia.org/r/477970 [09:10:15] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211301 (10ops-monitoring-bot) [09:12:08] (03CR) 10Elukey: [C: 032] profile::hadoop::common: remove explicit dep ordering [puppet] - 10https://gerrit.wikimedia.org/r/477970 (owner: 10Elukey) [09:12:57] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211302 (10ops-monitoring-bot) [09:28:35] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211303 (10ops-monitoring-bot) [09:29:41] (03CR) 10Gehel: "LGTM, minor comment inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 (owner: 10Mathew.onipe) [09:32:50] (03CR) 10Gehel: [C: 04-1] "LGTM, let's wait until elasticsearch is shutdown on all those servers to merge it" [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe) [09:33:15] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211304 (10ops-monitoring-bot) [09:38:07] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211307 (10ops-monitoring-bot) [09:39:38] (03PS1) 10Ema: ATS: define global Lua scripts in plugin.config [puppet] - 10https://gerrit.wikimedia.org/r/477974 (https://phabricator.wikimedia.org/T207048) [09:42:07] (03PS1) 10Elukey: profile::hadoop::common: add ordering for apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/477975 [09:44:15] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel) [09:44:28] (03CR) 10Elukey: [C: 032] profile::hadoop::common: add ordering for apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/477975 (owner: 10Elukey) [09:45:52] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel) All servers configured. @Papaul I'm not sure if you need to track anything else on this task, but from my side, it can be closed. [09:48:54] (03PS2) 10Ema: ATS: define global Lua scripts in plugin.config [puppet] - 10https://gerrit.wikimedia.org/r/477974 (https://phabricator.wikimedia.org/T207048) [09:52:44] (03PS2) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 [09:53:19] (03CR) 10jerkins-bot: [V: 04-1] Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris) [09:53:50] (03CR) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris) [09:56:02] (03PS3) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 [09:56:28] !log elastic@codfw cleanup: deleting wikidatawiki_content_1537469318 index (failed reindex probably) [09:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:34] (03CR) 10jerkins-bot: [V: 04-1] Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris) [09:57:45] (03PS3) 10Ema: ATS: define global Lua scripts in plugin.config [puppet] - 10https://gerrit.wikimedia.org/r/477974 (https://phabricator.wikimedia.org/T207048) [09:59:06] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211308 (10ops-monitoring-bot) [09:59:44] (03CR) 10Muehlenhoff: prometheus: add directory size collector (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite) [10:01:34] (03CR) 10Ema: [C: 032] ATS: define global Lua scripts in plugin.config [puppet] - 10https://gerrit.wikimedia.org/r/477974 (https://phabricator.wikimedia.org/T207048) (owner: 10Ema) [10:02:46] (03PS4) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 [10:02:54] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211309 (10ops-monitoring-bot) [10:03:05] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Mathew.onipe) [10:06:33] 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10Aklapper) I've seen a few people not understanding `[Report Only] Refused to connect to blah`, thinking it is an error. I can only point to http://bots.wmflabs.org/~... [10:10:14] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Mathew.onipe) [10:12:09] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Mathew.onipe) [10:14:01] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211310 (10ops-monitoring-bot) [10:14:58] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Mathew.onipe) [10:17:04] PROBLEM - puppet last run on an-worker1087 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 13 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hadoop-client],Package[libhdfs0] [10:19:18] (03PS1) 10Hoo man: Wikidata: Display Kartographer mapframes for geocoordinate statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477976 (https://phabricator.wikimedia.org/T184933) [10:19:39] (03PS3) 10Alexandros Kosiaris: upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn) [10:22:17] (03CR) 10Alexandros Kosiaris: [C: 032] "jenkins +2ed https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475260/ with this change." [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris) [10:22:26] (03PS5) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 [10:22:37] (03PS2) 10Mathew.onipe: setup: change curator version to 4.2.5 to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 [10:24:10] (03CR) 10Alexandros Kosiaris: "rebased on top of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/477754/ jenkins +2s. an extended PCC run also passed so I 'll be " [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn) [10:27:14] RECOVERY - puppet last run on an-worker1087 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:29:26] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211311 (10ops-monitoring-bot) [10:29:55] !log depooling db1096 for schema change - T85757 [10:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:59] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:30:18] (03CR) 10Banyek: [C: 032] mariadb: depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477589 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:31:22] (03Merged) 10jenkins-bot: mariadb: depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477589 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:31:50] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211313 (10ops-monitoring-bot) [10:33:57] (03CR) 10jenkins-bot: mariadb: depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477589 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:36:57] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1096:3315 (duration: 00m 49s) [10:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:02] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:38:00] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211315 (10ops-monitoring-bot) [10:41:38] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211316 (10ops-monitoring-bot) [10:43:15] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211317 (10ops-monitoring-bot) [10:48:21] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211318 (10ops-monitoring-bot) [10:50:07] (03PS4) 10Volans: extdist: Switch to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/475579 (https://phabricator.wikimedia.org/T210312) (owner: 10Legoktm) [10:51:00] (03CR) 10Volans: [C: 032] "Great, merging." [puppet] - 10https://gerrit.wikimedia.org/r/475579 (https://phabricator.wikimedia.org/T210312) (owner: 10Legoktm) [10:51:02] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211319 (10ops-monitoring-bot) [10:53:42] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211320 (10ops-monitoring-bot) [10:55:37] (03PS1) 10MarcoAurelio: Add NS_PROJECT localised name for tt.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477979 (https://phabricator.wikimedia.org/T211312) [10:56:10] (03PS1) 10Banyek: Revert "mariadb: depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477980 [10:56:16] !log repooling db1096 for schema change - T85757 [10:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:19] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:56:46] !log disable event handler on Icinga for ms-be2047 MD Raid and MegaRAID checks, it's spamming Phabricator - T209921 [10:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:49] T209921: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 [10:56:50] godog: ^^^ [10:57:01] I'm cleaning the takss on phab, sorry about that [10:57:11] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477980 (owner: 10Banyek) [10:58:13] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477980 (owner: 10Banyek) [10:58:21] (03PS2) 10MarcoAurelio: Add NS_PROJECT localised name for tt.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477979 (https://phabricator.wikimedia.org/T211312) [10:58:50] (03PS1) 10Elukey: profile::cdh::apt: add custom exec for apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/477981 [10:59:49] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1096:3315 (duration: 00m 47s) [10:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:57] (03CR) 10jenkins-bot: Revert "mariadb: depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477980 (owner: 10Banyek) [11:01:19] (03CR) 10Elukey: [C: 032] profile::cdh::apt: add custom exec for apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/477981 (owner: 10Elukey) [11:01:26] (03PS2) 10Elukey: profile::cdh::apt: add custom exec for apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/477981 [11:05:11] (03CR) 10GTirloni: prometheus: add directory size collector (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite) [11:08:14] (03CR) 10Mathew.onipe: setup: change curator version to 4.2.5 to match our current elasticsearch version (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 (owner: 10Mathew.onipe) [11:09:55] (03PS3) 10Mathew.onipe: setup: change curator version to 4.2.5 to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 [11:13:54] (03CR) 10Volans: setup: change curator version to 4.2.5 to match our current elasticsearch version (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 (owner: 10Mathew.onipe) [11:15:19] (03PS1) 10Muehlenhoff: Add kerberos puppet wrapper (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/477987 [11:16:14] (03CR) 10jerkins-bot: [V: 04-1] Add kerberos puppet wrapper (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/477987 (owner: 10Muehlenhoff) [11:19:26] (03CR) 10Muehlenhoff: prometheus: add directory size collector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite) [11:20:54] (03PS2) 10Muehlenhoff: Add kerberos puppet wrapper (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/477987 [11:24:10] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211320 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:24:17] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211319 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:24:24] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211318 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:24:30] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211317 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:24:37] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211316 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:25:08] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211315 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:25:14] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211313 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:25:22] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211311 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:25:33] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211310 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:25:38] (03CR) 10Phuedx: EventLogging Logstash filter: move useful fields out of event (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [11:25:39] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211309 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:25:46] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211308 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:25:54] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211307 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:26:02] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211304 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:26:08] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211303 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:26:16] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211302 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:26:22] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211301 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:26:31] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211300 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:26:34] lol ? [11:26:38] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211299 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:26:44] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211298 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:26:51] that many duplicates ? [11:26:52] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211296 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:26:55] akosiaris_: mass-closing duplicates [11:26:57] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211293 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:26:59] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211295 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:01] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211292 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:04] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211290 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:06] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211291 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:08] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211289 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:10] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211288 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:12] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211283 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:15] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211286 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:16] (03CR) 10Phuedx: [C: 031] "Including Gergo's review above too." [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [11:27:18] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211285 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:20] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211287 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:22] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211284 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:24] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211282 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:27] mobrovac: with a script I guess ? [11:27:28] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211281 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:30] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211279 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:34] 10Operations, 10ops-codfw: Degraded RAID on ms-be2047 - https://phabricator.wikimedia.org/T211278 (10mobrovac) 05Open>03Invalid Duplicate of {T209921} [11:27:38] akosiaris_: mass-edit functionality of phab [11:27:52] what ? TIL [11:28:17] akosiaris_: https://www.mediawiki.org/wiki/Phabricator/Help#Batch_edits [11:28:23] https://www.mediawiki.org/wiki/Phabricator/Help#Batch_edits [11:28:25] but you have to be in a specific group [11:28:28] yeah never used that before [11:28:33] I couldn't add myself even from admin [11:28:38] ahahaha [11:28:56] so wait, you can't add yourself, but mobrovac can actually do it ? [11:28:57] volans: that's called segregation of power [11:29:05] i'm in the group [11:29:13] I think it depends on acl*phabricator somehow [11:31:23] 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10chasemp) Good points @aklapper. I am not sure if this wording is ours or default. I am making a note to discuss with #security-team. One question, I have done som... [11:31:27] (03CR) 10Phuedx: [C: 031] "You might also consider adding a remove_field filter to strip out the not-very-helpful schema field (which is always "EventError")." [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [11:31:58] (03CR) 10Alexandros Kosiaris: [C: 031] "PCC says noop at https://puppet-compiler.wmflabs.org/compiler1002/13850/, so after the dependent change is merged today, we can probably m" [puppet] - 10https://gerrit.wikimedia.org/r/475261 (owner: 10Dzahn) [11:35:15] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Volans) @Papaul @fgiunchedi Today the RAID alarm was continuously flapping and created a ton of tasks (see above) that I asked mo.brovac to close as he had access to the batch edit interface in Phabricator.... [11:36:07] 10Operations, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10Joe) I would add another requirement: - we want all mediawiki cronjobs to only run in the datacenter where mediawiki is active right now. [11:36:47] 10Operations, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki) [11:36:56] (03PS1) 10Ema: ATS: define one single global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/477990 (https://phabricator.wikimedia.org/T207048) [11:37:09] 10Operations, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki) >>! In T211250#4803144, @Joe wrote: > I would add another requirement: > > - we want all mediawiki cronjobs to only run in the datacenter where mediawiki is active right now. Added in descr... [11:37:49] (03PS3) 10Muehlenhoff: Add kerberos puppet wrapper (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/477987 [11:40:38] (03PS2) 10Ema: ATS: define one single global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/477990 (https://phabricator.wikimedia.org/T207048) [11:42:23] (03PS1) 10Ema: ATS: do not cache files bigger than 1GB [puppet] - 10https://gerrit.wikimedia.org/r/477991 (https://phabricator.wikimedia.org/T209021) [11:49:23] jouncebot: next [11:49:23] In 0 hour(s) and 10 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1200) [11:52:07] (03Abandoned) 10Alexandros Kosiaris: prometheus: Add a service label for OTRS [puppet] - 10https://gerrit.wikimedia.org/r/385385 (owner: 10Alexandros Kosiaris) [11:58:36] 10Operations, 10Security-Team, 10Wikimedia-Site-requests: Enable csp-report-only mode everywhere - https://phabricator.wikimedia.org/T207900 (10Aklapper) > Where are people seeing this? Console of the web browser's Developer Tools [11:59:54] !log installing nginx updates on mw canaries [11:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1200). [12:00:04] Hauskatze: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:13] o/ [12:05:32] anyone for SWAT? [12:06:27] (03PS3) 10Ema: ATS: define one single global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/477990 (https://phabricator.wikimedia.org/T207048) [12:07:27] (03CR) 10Ema: [C: 032] ATS: define one single global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/477990 (https://phabricator.wikimedia.org/T207048) (owner: 10Ema) [12:14:09] Lucas_WMDE: hi [12:14:45] (03PS2) 10Ema: ATS: do not cache files bigger than 1GB [puppet] - 10https://gerrit.wikimedia.org/r/477991 (https://phabricator.wikimedia.org/T209021) [12:16:32] (03CR) 10Ema: [C: 032] ATS: do not cache files bigger than 1GB [puppet] - 10https://gerrit.wikimedia.org/r/477991 (https://phabricator.wikimedia.org/T209021) (owner: 10Ema) [12:21:30] No one avalaible for deploying a little patch? [12:21:58] 10Operations, 10Traffic, 10Patch-For-Review: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) [12:22:48] Hauskatze: not me, sorry, I’m in a meeting [12:23:01] (addshore not available either for the same reason) [12:23:04] Lucas_WMDE: thanks for replying, at least [12:23:46] well, I'll keep waiting :) [12:23:57] good luck :) [12:24:08] apparently this is a bad week for SWAT, due to some releng offsite :/ [12:24:35] yup, but at Wikitech says SWATs are okay [12:24:52] no MW train though [12:28:28] 10Operations: SRE quarterly goal: Ability to serve a fraction of the production traffic from PHP7 - https://phabricator.wikimedia.org/T206336 (10Joe) [12:28:34] 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe) 05Open>03Resolved [12:29:41] (03CR) 10Effie Mouzeli: [C: 031] "With '-exact', rebuilding python-thumbor-wikimedia for Debian stretch was successful." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) (owner: 10Gilles) [12:31:43] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10User-ArielGlenn, 10User-Joe: Correctly collect logs from php-fpm pools - https://phabricator.wikimedia.org/T211184 (10Joe) [12:38:06] 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Banyek) I'll take a look into this [12:40:21] 10Operations, 10Traffic, 10Patch-For-Review: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [12:40:24] 10Operations, 10Traffic, 10Patch-For-Review: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) 05Open>03Resolved All the functionalities currently provided by our varnish backends in terms of request/response mangling have been implemented, with two exceptions: 1. ATS... [12:42:56] (03PS1) 10Giuseppe Lavagetto: profile::hadoop::common: explicitly contain class cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/477995 [12:45:20] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13852/ at the compiler level it's a noop but it could break puppet on all servers inclidi" [puppet] - 10https://gerrit.wikimedia.org/r/477995 (owner: 10Giuseppe Lavagetto) [12:47:29] (03CR) 10Elukey: [C: 032] profile::hadoop::common: explicitly contain class cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/477995 (owner: 10Giuseppe Lavagetto) [12:53:15] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 44 probes of 335 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [12:55:43] !log installing nginx updates on mw in codfw [12:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:09] (03PS1) 10Giuseppe Lavagetto: puppet-merge: avoid warning in numeric equality [puppet] - 10https://gerrit.wikimedia.org/r/477996 [12:56:36] !log depooling and shutting down elasticsearch on elastic2001-2024 - T211023 [12:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:40] T211023: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 [12:58:11] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 13 probes of 335 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1300) [13:00:57] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - search-https_9243: Servers elastic2031.codfw.wmnet, elastic2042.codfw.wmnet, elastic2041.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2027.codfw.wmnet, elastic2026.codfw.wmnet, elastic2038.codfw.wmnet, elastic2039.codfw.wmnet, elastic2054.codfw.wmnet, elastic2035.codfw.wmnet, elastic2037.codf [13:00:57] 025.codfw.wmnet, elastic2051.codfw.wmnet, elastic2044.codfw.wmnet, elastic2040.codfw.wmnet, elastic2045.codfw.wmnet, elastic2043.codfw.wmnet, elastic2034.codfw.wmnet, elastic2036.codfw.wmnet, elastic2049.codfw.wmnet, elastic2032.codfw.wmnet, elastic2028.codfw.wmnet, elastic2030.codfw.wmnet, elastic2046.codfw.wmnet, elastic2047.codfw.wmnet are marked down but pooled: search_9200: Servers elastic2031.codfw.wmnet, elastic2042.codfw. [13:00:57] 1.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2026.codfw.wmnet, elastic2038.codfw.wmnet, elastic2050.codfw.wmnet, elastic2048.codfw.wm [13:01:05] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([elastic2031.codfw.wmnet, elastic2042.codfw.wmnet, elastic2027.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2040.codfw.wmnet, elastic2038.codfw.wmnet, elastic2050.codfw.wmnet, elastic2045.codfw.wmnet, elastic2039.codfw.wmnet, elastic2048.codfw.wmnet, elastic2054.codfw.wmnet, e [13:01:05] wmnet, elastic2037.codfw.wmnet, elastic2043.codfw.wmnet, elastic2051.codfw.wmnet, elastic2052.codfw.wmnet, elastic2044.codfw.wmnet, elastic2026.codfw.wmnet, elastic2041.codfw.wmnet, elastic2025.codfw.wmnet, elastic2034.codfw.wmnet, elastic2036.codfw.wmnet, elastic2049.codfw.wmnet, elastic2032.codfw.wmnet, elastic2028.codfw.wmnet, elastic2030.codfw.wmnet, elastic2046.codfw.wmnet, elastic2047.codfw.wmnet]) [13:01:15] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - search-https_9243: Servers elastic2031.codfw.wmnet, elastic2042.codfw.wmnet, elastic2041.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2027.codfw.wmnet, elastic2026.codfw.wmnet, elastic2050.codfw.wmnet, elastic2048.codfw.wmnet, elastic2054.codfw.wmnet, elastic2035.codfw.wmnet, elastic2037.codf [13:01:15] 025.codfw.wmnet, elastic2051.codfw.wmnet, elastic2044.codfw.wmnet, elastic2040.codfw.wmnet, elastic2045.codfw.wmnet, elastic2043.codfw.wmnet, elastic2034.codfw.wmnet, elastic2036.codfw.wmnet, elastic2049.codfw.wmnet, elastic2032.codfw.wmnet, elastic2028.codfw.wmnet, elastic2030.codfw.wmnet, elastic2046.codfw.wmnet, elastic2047.codfw.wmnet are marked down but pooled: search_9200: Servers elastic2031.codfw.wmnet, elastic2042.codfw. [13:01:15] 1.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2027.codfw.wmnet, elastic2026.codfw.wmnet, elastic2038.codfw.wmnet, elastic2050.codfw.wm [13:01:25] ^ that's me, should be back in a second (and yes, it is a real issue) [13:02:05] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [13:02:15] \ [13:02:20] (sorry typo) [13:02:23] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [13:06:19] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal [13:07:51] (03PS1) 10Gehel: elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) [13:11:11] PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 3443 threshold =0.15 breach: status: yellow, number_of_nodes: 53, unassigned_shards: 3346, number_of_pending_tasks: 3147, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3169, task_max_waiting_in_queue_millis: 586849, cluster_name: production-search-codfw, relocating_shards: 0, act [13:11:11] t_as_number: 63.7235275524, active_shards: 6048, initializing_shards: 97, number_of_data_nodes: 53, delayed_unassigned_shards: 12 [13:11:25] <_joe_> gehel: gat us gaooebubg> [13:11:34] <_joe_> err off by one [13:11:36] ? [13:11:37] (03CR) 10Mathew.onipe: [C: 031] elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel) [13:11:38] <_joe_> what is happening [13:11:42] the entire sentence ? [13:11:43] lol [13:11:44] <_joe_> gehel: onimisionipe [13:12:14] <_joe_> pybal is ok but the elasticsearch error? [13:12:24] (03CR) 10DCausse: [C: 04-1] elasticsearch: move master eligible nodes to new servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel) [13:12:37] _joe_ gehel pointed out already it's him [13:12:45] _joe_: gehel is on it... [13:12:45] (03PS1) 10Elukey: profile::hadoop::worker: move prometheus require before cdh class [puppet] - 10https://gerrit.wikimedia.org/r/477998 [13:12:46] <_joe_> akosiaris_: yeah, the pybal errors [13:12:57] <_joe_> but the health check issue is not expected [13:13:02] <_joe_> AFAICT [13:13:05] <_joe_> ok cool [13:14:31] for some definition of "expected" [13:14:32] (03CR) 10Elukey: [C: 032] profile::hadoop::worker: move prometheus require before cdh class [puppet] - 10https://gerrit.wikimedia.org/r/477998 (owner: 10Elukey) [13:14:52] I messed up the step order in the decom of old elastic servers [13:15:12] cluster is recovering already, but some indices are still in error [13:15:18] need help ? [13:15:24] not a huge deal, this is codfw, no user traffic [13:15:29] ok [13:15:35] nope, all is good, just need to wait at this point [13:16:14] * gehel is going to get the 9 cat tails and flog himself as penitence [13:17:22] ACKNOWLEDGEMENT - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 2417 threshold =0.15 breach: status: yellow, number_of_nodes: 54, unassigned_shards: 2334, number_of_pending_tasks: 386, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3169, task_max_waiting_in_queue_millis: 366680, cluster_name: production-search-codfw, relocating_shards: [13:17:23] _percent_as_number: 74.5337688336, active_shards: 7074, initializing_shards: 83, number_of_data_nodes: 54, delayed_unassigned_shards: 0 Gehel cluster recovering, should be green in a few minutes [13:20:40] 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Banyek) [13:22:40] there is a lot of elastichsearch cronspam still coming in, is that expected? [13:22:46] gehel: [13:22:55] apergos: yes, see backlog [13:23:11] apergos: something with hotthreads? [13:23:20] oh, yeah forgot about hose as well [13:23:39] expected, not an issue except for the spam itself [13:23:41] on it [13:23:59] thank you! [13:25:47] 10Operations, 10Datasets-General-or-Unknown, 10WMDE-Analytics-Engineering, 10Wikidata: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739 (10Addshore) For reference for any future people looking at this, this is currently used by: - https://github.com/wikimedia/puppet/blob/... [13:26:36] RECOVERY - ElasticSearch health check for shards on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 54, unassigned_shards: 1352, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3169, task_max_waiting_in_queue_millis: 924614, cluster_name: production-search-codfw, relocating_shards: 0, active_shards_percent_as_numbe [13:26:36] active_shards: 8084, initializing_shards: 55, number_of_data_nodes: 54, delayed_unassigned_shards: 0 [13:27:25] (03PS1) 10Gehel: elasticsearch: ignore output of hot thread cron [puppet] - 10https://gerrit.wikimedia.org/r/477999 [13:31:02] (03PS2) 10Gehel: elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) [13:32:33] (03CR) 10Gehel: [C: 032] elasticsearch: ignore output of hot thread cron [puppet] - 10https://gerrit.wikimedia.org/r/477999 (owner: 10Gehel) [13:33:00] 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Banyek) I checked the query `SELECT /* Wikimedia\Rdbms\Database::select www-data@mwmain... */ DISTINCT( pa_project_id ) FROM `pa... [13:36:45] on the move, bbl [13:37:38] (03CR) 10DCausse: [C: 04-1] elasticsearch: move master eligible nodes to new servers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel) [13:38:09] (03PS1) 10Ema: ATS: support Collapsed Forwarding [puppet] - 10https://gerrit.wikimedia.org/r/478003 (https://phabricator.wikimedia.org/T207048) [13:41:37] (03CR) 10DCausse: [C: 04-1] elasticsearch: move master eligible nodes to new servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel) [13:43:43] (03PS3) 10Gehel: elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) [13:46:50] !log upgrading spamassassin on mx2001 [13:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:46] (03CR) 10DCausse: [C: 031] elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel) [13:49:02] (03PS4) 10Gehel: elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) [13:50:00] (03CR) 10Gehel: [C: 032] elasticsearch: move master eligible nodes to new servers [puppet] - 10https://gerrit.wikimedia.org/r/477997 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel) [13:54:29] 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-Database, 10MediaWiki-extensions-PageAssessments: cron spam from mwmaint1002 - https://phabricator.wikimedia.org/T211269 (10Banyek) 05Open>03Invalid This is a duplicate of T208231 [13:55:01] 10Operations, 10DBA: Issues with purgeUnusedProjects.php cron job on mwmaint1002 (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10Banyek) [13:59:40] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:00:19] gehel: elasticsearch_5@production-search-omega-codfw.service failed [14:00:30] yep, looking [14:00:56] * volans a bit surprised to have a codfw service in eqiad ;) [14:01:13] it shouldn't be there [14:01:17] obviously [14:01:30] templated systemd unit [14:01:37] got it [14:01:43] * gehel is looking through his history to see what he did wrong [14:03:06] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational [14:13:18] (03PS1) 10Muehlenhoff: Remove Diamond from DNS roles [puppet] - 10https://gerrit.wikimedia.org/r/478016 (https://phabricator.wikimedia.org/T183454) [14:16:51] !log installing nginx security updates on mw in eqiad [14:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:52] 10Operations, 10DBA: Issues with purgeUnusedProjects.php cron job on mwmaint1002 (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10Banyek) [14:22:08] 10Operations, 10DBA, 10User-Banyek: Issues with purgeUnusedProjects.php cron job on mwmaint1002 (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10Banyek) [14:25:38] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, 10Multimedia: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T211270 (10jijiki) 05Open>03Invalid Duplicate of T150375 [14:26:33] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, 10Multimedia: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T211270 (10Zoranzoki21) 05Invalid>03Open [14:26:49] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, 10Multimedia: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T211270 (10Zoranzoki21) [14:26:57] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, and 3 others: cronspam cleanup: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T150375 (10Zoranzoki21) [14:28:54] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, and 3 others: cronspam cleanup: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T150375 (10jijiki) @Aklapper We have resolved it on IR... [14:36:32] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-merge: avoid warning in numeric equality [puppet] - 10https://gerrit.wikimedia.org/r/477996 (owner: 10Giuseppe Lavagetto) [14:36:44] (03PS2) 10Giuseppe Lavagetto: puppet-merge: avoid warning in numeric equality [puppet] - 10https://gerrit.wikimedia.org/r/477996 [14:37:14] 10Operations, 10ops-eqiad, 10Analytics, 10cloud-services-team: Degraded RAID on cloudvirtan1001 - https://phabricator.wikimedia.org/T211235 (10jijiki) p:05Triage>03High [14:37:50] 10Operations, 10ops-eqiad, 10Analytics, 10cloud-services-team: Degraded RAID on cloudvirtan1001 - https://phabricator.wikimedia.org/T211235 (10jijiki) The alert from icinga is gone, close this if you believe everything is ok :) [14:46:39] !log uploaded nodejs 10.4.0~dfsg-1+wmf2 to apt.wikimedia.org/component/node10 (backports of recent security fixes) [14:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:53] moritzm: good if I move turnilo to node 10? [14:51:34] (03PS1) 10Elukey: Move remaining stat1005 references to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/478020 (https://phabricator.wikimedia.org/T205846) [14:51:49] 10Operations, 10Wikimedia-Logstash: Procure and provision Logging pipeline hardware in multiple datacenters - https://phabricator.wikimedia.org/T205850 (10herron) [14:51:52] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10herron) [14:53:13] elukey: let's go ahead! [14:56:36] (03PS6) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 [14:56:39] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris) [14:59:36] !log elukey@deploy1001 Started deploy [analytics/turnilo/deploy@6bd6e2f]: upgrade deps to nodejs 10 [14:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:44] !log elukey@deploy1001 Finished deploy [analytics/turnilo/deploy@6bd6e2f]: upgrade deps to nodejs 10 (duration: 00m 09s) [14:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] moritzm: done! [15:01:00] elukey: congratulations for being the first non-k8s service to migrate :-) [15:01:08] (03PS4) 10Alexandros Kosiaris: upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn) [15:01:14] \o/ [15:01:15] (03CR) 10Alexandros Kosiaris: [C: 032] upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn) [15:01:29] (03PS2) 10Muehlenhoff: Install Imagemagick policy files for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/476818 [15:04:27] (03CR) 10Muehlenhoff: [C: 032] Install Imagemagick policy files for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/476818 (owner: 10Muehlenhoff) [15:05:11] !log upgrade nginx on wdqs servers [15:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:22] (03PS1) 10Giuseppe Lavagetto: Hotfix for logging in php-fpm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478021 (https://phabricator.wikimedia.org/T211184) [15:08:19] (03CR) 10jerkins-bot: [V: 04-1] Hotfix for logging in php-fpm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478021 (https://phabricator.wikimedia.org/T211184) (owner: 10Giuseppe Lavagetto) [15:08:50] !log restartign new elasticsearch masters on codfw - T211023 [15:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:53] T211023: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 [15:09:02] (03PS5) 10Alexandros Kosiaris: upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn) [15:09:04] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn) [15:09:48] PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:10:10] (03PS2) 10Alexandros Kosiaris: upgrade puppet stdlib from 4.24.0 to 4.25.1 [puppet] - 10https://gerrit.wikimedia.org/r/475261 (owner: 10Dzahn) [15:12:07] Dec 6 15:04:43 analytics1066 puppet-agent[15161]: Could not retrieve catalog from remote server: Error 503 on SERVER: [15:12:44] maybe is it a temp glitch due to the stdlib migration? [15:14:57] ah yes a puppet run fixes it [15:15:00] RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:26:50] it would be [15:26:54] for 503 with HTML ? [15:27:00] it could* be [15:27:26] no idea, after a manual puppet run the new stdlib was deployed and everything went to normal [15:27:41] there is a race condition when merging changes for files that get added/removed [15:28:10] it's been the catalog actually being compiled and being applied [15:28:22] it might very well be a catalog with the old change [15:28:53] and the new revision might no longer have the files mentioned in the catalog (or have different versions) [15:29:13] (03PS1) 10Elukey: profile::statistics::private: allow labsdb to push nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/478022 (https://phabricator.wikimedia.org/T211330) [15:29:21] it coalesces pretty quickly [15:29:40] the entire window is probably something like 3-5 secs [15:32:48] (03PS2) 10Elukey: profile::statistics::private: allow labsdb to push nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/478022 (https://phabricator.wikimedia.org/T211330) [15:32:58] (03CR) 10ArielGlenn: [C: 031] "The labstore/dumps related pieces of this look fine to me, as long as all the rest of stat1005's functionality has also been moved over." [puppet] - 10https://gerrit.wikimedia.org/r/478020 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [15:38:40] !log uploaded nodejs 6.11~dfsg-1+wmf5 for stretch-wikimedia (the upstream patch for CVE-2018-12122 had a regression, this update fixes it) [15:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:02] <_joe_> akosiaris: are you done with stdlib? [15:39:25] (03PS2) 10Giuseppe Lavagetto: wmflib: make the role() function store a path in $::_role [puppet] - 10https://gerrit.wikimedia.org/r/475498 [15:39:26] <_joe_> I want to merge my hiera changes [15:39:33] <_joe_> at least the first two [15:40:59] _joe_: yes [15:41:10] the next version bump is for Monday [15:42:13] <_joe_> ack [15:42:30] <_joe_> !log disabling puppet fleet-wide for a change in the role() function [15:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:40] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: make the role() function store a path in $::_role [puppet] - 10https://gerrit.wikimedia.org/r/475498 (owner: 10Giuseppe Lavagetto) [15:42:51] !log modifying zotero deploy CLUSTER=codfw scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero - T211322 [15:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:55] T211322: zotero pods on codfw should use codfw url downloader - https://phabricator.wikimedia.org/T211322 [15:45:01] !log fsero@deploy1001 scap-helm zotero upgrade production -f ../zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [15:45:02] !log fsero@deploy1001 scap-helm zotero cluster codfw completed [15:45:02] !log fsero@deploy1001 scap-helm zotero finished [15:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:54] (03CR) 10ArielGlenn: [C: 031] "Seems ok to me, though if this temporary solution turns into a long term one, we don't really need to give everything in $statistics_serve" [puppet] - 10https://gerrit.wikimedia.org/r/478022 (https://phabricator.wikimedia.org/T211330) (owner: 10Elukey) [15:51:20] !log upgrading spamassassin on mx1001/fermium [15:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:18] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 141.7 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [16:02:05] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Andy.Johnson@dell.com 9:40 AM (21 minutes ago) to me, faidon Dell Customer Communication Here is a link to the Dell Support Live Image (SLI) Version 3.0. with this we can test the hardware... [16:12:49] (03PS2) 10Herron: logstash: ship kafka server logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) [16:17:27] !log shutting down elasticsearch on elastic2001-2024 (second try) - T211023 [16:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:31] T211023: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 [16:21:35] 10Operations, 10Phabricator: Switch PHP-FPM on phab1002 - https://phabricator.wikimedia.org/T211353 (10Paladox) [16:23:07] (03PS3) 10Gehel: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe) [16:27:17] !log decommissioning cassandra-a, restbase2001 -- T210843 [16:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:21] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [16:39:26] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Banyek) @Ottomata Yes, I don't see anything against this. Just make sure that the data is copied over a secure channel and get removed both the export... [16:44:49] (03CR) 10DCausse: [C: 04-1] elasticsearch: Remove elastic2001-elastic2024 from codfw cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe) [16:45:42] (03PS1) 10Andrew Bogott: no-op patch for tox testing purposes [software/cumin] - 10https://gerrit.wikimedia.org/r/478026 [16:56:33] (03PS1) 10Bmansurov: Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882) [16:57:33] (03CR) 10jerkins-bot: [V: 04-1] Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [16:58:01] (03PS1) 10Gehel: elasticsearch: purge main elasticsearch configuration directory [puppet] - 10https://gerrit.wikimedia.org/r/478029 (https://phabricator.wikimedia.org/T211023) [17:00:04] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:02:42] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Gehel) We need to reassign some nodes between the psi and omega cluster, as removing old nodes would leave the clusters unbalanced between rows. This will require... [17:05:01] 10Operations, 10DBA, 10User-Banyek: Issues with purgeUnusedProjects.php cron job on mwmaint1002 (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10Banyek) i'd like to add the owner of the script as a subscriber, but I don't know how to find who is it [17:05:34] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] "Are there already entries in the database that must be updated accordingly? Please make sure this happens the same time this is merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477856 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [17:08:25] (03PS2) 10Bmansurov: Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882) [17:09:26] (03CR) 10Gehel: "PCC looks reasonable (https://puppet-compiler.wmflabs.org/compiler1002/13854/), checking a few servers (both elastic and logstash) it look" [puppet] - 10https://gerrit.wikimedia.org/r/478029 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel) [17:09:47] (03PS1) 10Volans: Add ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/478030 (https://phabricator.wikimedia.org/T205884) [17:11:46] !log uploaded nodejs 6.11~dfsg-1+wmf5 for jessie-wikimedia (the upstream patch for CVE-2018-12122 had a regression, this update fixes it) [17:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:03] (03PS7) 10Paladox: profile::phabricator::httpd: Fix worker configs and also use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925 [17:15:38] (03PS4) 10Paladox: httpd::mpm: Add php7.0 and php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/477587 [17:15:40] (03PS6) 10Paladox: phabricator: Increase 'max_execution_time' to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 [17:16:59] (03PS1) 10Paladox: phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032 [17:17:28] (03PS2) 10Paladox: phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032 [17:17:54] (03PS3) 10Paladox: phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) [17:17:56] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) (owner: 10Paladox) [17:18:15] (03CR) 10DCausse: [C: 031] elasticsearch: purge main elasticsearch configuration directory [puppet] - 10https://gerrit.wikimedia.org/r/478029 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel) [17:31:44] !log installing nodejs updates on proton* [17:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:23] (03PS1) 10Shreyasminocha: Add HD logos for 3 projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478034 [17:45:32] (03PS1) 10Shreyasminocha: Update settings to include new HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478036 [17:46:24] (03CR) 10jerkins-bot: [V: 04-1] Update settings to include new HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478036 (owner: 10Shreyasminocha) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1800). [18:00:18] no parsoid deploy today [18:13:54] PROBLEM - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003545, end_log_pos 943920774 [18:17:11] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@1dba3cd]: Internally promisify page processing steps (T202642) [18:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:15] T202642: Investigate how to fix the performance problems caused by CPU bound work on the MCS services - https://phabricator.wikimedia.org/T202642 [18:18:09] (03CR) 10Elukey: "@Bstorm: any concern from the labsdb side?" [puppet] - 10https://gerrit.wikimedia.org/r/478020 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [18:18:25] (03CR) 10Elukey: "s/labsdb/labstore/" [puppet] - 10https://gerrit.wikimedia.org/r/478020 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [18:19:09] 10Operations, 10Phabricator, 10Patch-For-Review: Switch PHP-FPM on phab1002 - https://phabricator.wikimedia.org/T211353 (10Dzahn) a:03Dzahn [18:19:18] 10Operations, 10Phabricator, 10Patch-For-Review: Switch PHP-FPM on phab1002 - https://phabricator.wikimedia.org/T211353 (10Dzahn) p:05Triage>03Normal [18:21:01] (03CR) 10Gehel: [C: 032] elasticsearch: purge main elasticsearch configuration directory [puppet] - 10https://gerrit.wikimedia.org/r/478029 (https://phabricator.wikimedia.org/T211023) (owner: 10Gehel) [18:21:04] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@1dba3cd]: Internally promisify page processing steps (T202642) (duration: 03m 54s) [18:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:36] PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 765.51 seconds [18:29:22] (03PS1) 10Bstorm: gridengine: simplifying the config and making it more "normal" for grid [puppet] - 10https://gerrit.wikimedia.org/r/478041 (https://phabricator.wikimedia.org/T211258) [18:31:05] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) ehm. i spent time on puppetizing this to make sure bmansurov's import script gets installed in a way that doesn't conflict with server access po... [18:31:15] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Gehel) a:03Gehel [18:32:18] (03CR) 10Bstorm: [C: 032] gridengine: simplifying the config and making it more "normal" for grid [puppet] - 10https://gerrit.wikimedia.org/r/478041 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [18:34:54] (03PS2) 10Dzahn: Partman: Add logstash200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/477923 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul) [18:36:48] (03CR) 10Dzahn: [C: 032] Partman: Add logstash200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/477923 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul) [18:41:41] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Set FileImporter config help location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477798 (https://phabricator.wikimedia.org/T199108) (owner: 10WMDE-Fisch) [18:42:44] (03PS2) 10Dzahn: DHCP: Add MAC address entries for logstash200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/477911 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul) [18:46:15] (03CR) 10Dzahn: [C: 032] "for the ones wondering about the different vendor ID in that one MAC." [puppet] - 10https://gerrit.wikimedia.org/r/477911 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul) [18:47:41] (03PS2) 10Dzahn: DNS: Add production and mgmt DNS entries for logstash200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/477868 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul) [18:51:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10faidon) [18:53:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10faidon) Per @bd808 on IRC: labstore1003 is still in use, blocked by T209527. labstore100[12] are not in use at the mom... [18:53:47] (03CR) 10Dzahn: [C: 032] "also checked matching WMF numbers in netbox" [dns] - 10https://gerrit.wikimedia.org/r/477868 (https://phabricator.wikimedia.org/T211065) (owner: 10Papaul) [18:57:28] (03PS4) 10Gehel: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe) [18:57:40] (03CR) 10Gehel: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe) [19:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181206T1900). [19:00:04] bmansurov: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:11] here [19:04:55] Who's deploying? [19:07:10] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof can anyone deploy? [19:07:50] I can do it [19:08:00] thanks [19:08:52] 10Operations, 10ops-eqsin: update PDUs for eqsin (asset tag and other info) - https://phabricator.wikimedia.org/T211368 (10RobH) p:05Triage>03Low [19:09:10] (03CR) 10Catrope: [C: 032] Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [19:10:15] (03Merged) 10jenkins-bot: Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [19:10:54] (03PS4) 10Cwhite: prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) [19:11:48] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite) [19:14:39] (03PS5) 10Cwhite: prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) [19:18:21] (03CR) 10jenkins-bot: Labs: enable reader trust survey on enwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478028 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [19:23:49] (03CR) 10DCausse: [C: 031] elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe) [19:26:29] (03PS5) 10Gehel: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe) [19:31:44] jouncebot: next [19:31:45] In 4 hour(s) and 28 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181207T0000) [19:31:54] (03CR) 10Cwhite: prometheus: add directory size collector (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite) [19:32:31] (03CR) 10Cwhite: [C: 031] Remove Diamond from DNS roles [puppet] - 10https://gerrit.wikimedia.org/r/478016 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [19:32:53] !log shutting down elasticsearch on elastic2001-2024 (third time is a charm) - T211023 [19:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:57] T211023: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 [19:33:09] 10Operations, 10ops-eqiad, 10Analytics, 10cloud-services-team: Degraded RAID on cloudvirtan1001 - https://phabricator.wikimedia.org/T211235 (10Ottomata) 05Open>03Resolved a:03Ottomata Assuming this was caused by @andrewbogott reformatting the hosts. Closing. [19:35:46] (03CR) 10Ottomata: [C: 031] profile::statistics::private: allow labsdb to push nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/478022 (https://phabricator.wikimedia.org/T211330) (owner: 10Elukey) [19:35:57] (03CR) 10Gehel: [C: 032] elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe) [19:36:51] RoanKattouw: thanks for deploying. The change's working. [19:37:03] (03PS2) 10Cwhite: wmcs: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) [19:38:30] (03CR) 10Cwhite: "> I think there's been some general confusion where the metrics are" [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite) [19:39:58] (03CR) 10Dzahn: [C: 032] "ack, 10 seemed too aggressive, 30 seems more default. this only influences future stretch system, not prod" [puppet] - 10https://gerrit.wikimedia.org/r/477595 (owner: 10Paladox) [19:40:29] (03PS7) 10Dzahn: phabricator: Increase 'max_execution_time' to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 (owner: 10Paladox) [19:41:07] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) Yah thanks for that @dzahn! The problem is a larger one: how should people get data out of analytics systems for production usage. The sear... [19:42:10] (03PS1) 10Herron: rsyslog: increase omkafka timeout to 10s [puppet] - 10https://gerrit.wikimedia.org/r/478045 (https://phabricator.wikimedia.org/T206633) [19:43:41] (03CR) 10Ottomata: "I think we should keep schema. If for some reason we ever put other types of events in here, it will be nice to be able to filter them." [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [19:44:10] PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:44:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10RStallman-legalteam) To update: switched to MOU & NDA which are now signed and filed with lega... [19:45:39] (03PS2) 10Herron: rsyslog: increase omkafka timeout to 10s [puppet] - 10https://gerrit.wikimedia.org/r/478045 (https://phabricator.wikimedia.org/T206633) [19:46:42] (03CR) 10Herron: [C: 032] rsyslog: increase omkafka timeout to 10s [puppet] - 10https://gerrit.wikimedia.org/r/478045 (https://phabricator.wikimedia.org/T206633) (owner: 10Herron) [19:47:24] (03Abandoned) 10Andrew Bogott: no-op patch for tox testing purposes [software/cumin] - 10https://gerrit.wikimedia.org/r/478026 (owner: 10Andrew Bogott) [19:47:52] (03PS4) 10Andrew Bogott: Openstack: support multiple regions [software/cumin] - 10https://gerrit.wikimedia.org/r/477811 (https://phabricator.wikimedia.org/T208861) [19:50:35] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:50:35] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:50:55] (03PS8) 10Dzahn: phabricator: Increase 'max_execution_time' to 30 [puppet] - 10https://gerrit.wikimedia.org/r/477595 (owner: 10Paladox) [19:51:28] herron: ^ do you know why these are already alerting? [19:51:41] i know they are brandnew, but this should only happen with a puppet role [19:51:47] PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:52:12] hmm [19:52:21] oh, sorry, wrong machines :) [19:52:25] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:52:25] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:52:40] (03PS5) 10Paladox: httpd::mpm: Add php7.0 and php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/477587 [19:53:12] herron: nevermind, they are not what i thought it was . already from https://phabricator.wikimedia.org/T154251 [19:53:15] mutante: elastic is me [19:53:21] gehel: ok :) thanks [19:53:49] I missed a few downtimes, but nothing to worry so far [19:53:54] kk [19:53:56] ok [19:54:08] cleanup coming up [19:57:52] (03CR) 10Cwhite: [C: 032] initial commit (035 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) (owner: 10Cwhite) [19:59:55] (03PS1) 10Gehel: elasticsearch: force deletion of unmanaged resources in /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/478048 [20:00:28] (03PS2) 10Gehel: elasticsearch: force deletion of unmanaged resources in /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/478048 [20:00:35] (03CR) 10Dzahn: profile::phabricator::httpd: Fix worker configs and also use hiera value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477925 (owner: 10Paladox) [20:01:26] (03CR) 10DCausse: [C: 031] elasticsearch: force deletion of unmanaged resources in /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/478048 (owner: 10Gehel) [20:01:44] (03CR) 10Gehel: [C: 032] elasticsearch: force deletion of unmanaged resources in /etc/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/478048 (owner: 10Gehel) [20:04:48] (03PS8) 10Paladox: profile::phabricator::httpd: Use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925 [20:05:52] (03PS2) 10Shreyasminocha: Update settings to include new HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478036 [20:07:16] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478034 (owner: 10Shreyasminocha) [20:07:21] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478036 (owner: 10Shreyasminocha) [20:11:16] (03PS1) 10Paladox: profile::phabricator::httpd: Update's worker config to match MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/478052 [20:11:31] (03PS9) 10Dzahn: phabricator: Use hiera value for phabricator_enable_php_fpm [puppet] - 10https://gerrit.wikimedia.org/r/477925 (owner: 10Paladox) [20:11:38] (03PS10) 10Dzahn: phabricator: Use hiera value for phabricator_enable_php_fpm [puppet] - 10https://gerrit.wikimedia.org/r/477925 (owner: 10Paladox) [20:11:59] (03CR) 10jerkins-bot: [V: 04-1] profile::phabricator::httpd: Update's worker config to match MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox) [20:15:15] RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational [20:15:35] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational [20:15:53] 10Operations, 10ops-eqiad, 10media-storage, 10Patch-For-Review: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10Cmjohnson) [20:16:24] 10Operations, 10ops-eqiad, 10media-storage, 10Patch-For-Review: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10Cmjohnson) @robh this is already assigned to you but these are ready for you to take over [20:16:32] RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on elastic2025 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1719 days) [20:16:32] RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on elastic2034 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1719 days) [20:17:41] (03CR) 10Dzahn: [C: 032] "noop https://puppet-compiler.wmflabs.org/compiler1002/13856/" [puppet] - 10https://gerrit.wikimedia.org/r/477925 (owner: 10Paladox) [20:19:57] (03CR) 10Dzahn: "looks more like "remove mod_php" than adding something. please edit the commit message a bit to explain why this is needed" [puppet] - 10https://gerrit.wikimedia.org/r/477587 (owner: 10Paladox) [20:21:00] (03PS6) 10Paladox: httpd::mpm: Add php7.0 and php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/477587 [20:21:09] (03PS3) 10Stella: Add several HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618) [20:21:32] (03CR) 10Dzahn: "i think i prefer we first do the switch to phab1002 as production host without doing this and then do this in a second step" [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) (owner: 10Paladox) [20:21:59] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @GTirloni This has been an ongoing thing since August, I have replaced the battery 3 maybe 4 times already. Replaced the raid controller once and replaced 4 SS... [20:23:58] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) @marostegui and all, the system board that was replaced yesterday was faulty. Showing errors on DIMM slots B4 and B1. After swapping DIMMs in B with DIMMs in A, the error remained B4... [20:25:28] RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on elastic2052 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1719 days) [20:27:18] RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on elastic2031 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1719 days) [20:27:18] (03PS3) 10Herron: logstash: ship kafka server logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) [20:28:19] 10Operations, 10ops-eqiad, 10Cloud-Services, 10DC-Ops: labvirt1018 -> cloudvirt1018: update physical label, network port description, netbox - https://phabricator.wikimedia.org/T207319 (10Cmjohnson) 05Open>03Resolved [20:28:34] (03CR) 10Herron: [C: 032] logstash: ship kafka server logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) (owner: 10Herron) [20:29:24] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Gehel) [20:29:34] (03PS1) 10Stella: Updated InitialiseSettings.php for HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618) [20:30:44] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Gehel) a:05Gehel>03RobH elastic2001-2024 are ready for decommission. They are taken our of the cluster and can be shutdown whenever you want (cc @Papaul) [20:31:24] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Andrew) @Cmjohnson am I correct in understanding that cloudvirt1020 has the exact same issue? Or has that been resolved somehow? [20:31:38] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10danstillman) > Have you looked at Domino We looked at Domino briefly and found some [alarming parsing probl... [20:33:57] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) Ok, thanks @Ottomata feel free to just hit "restore" on that and apply it on another host once we get there. [20:34:34] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @andrew yes, you are correct it is the same exact issue. My goal was to work with one, figure out the issue and then go to HPE with a solution but that obviously... [20:36:19] (03PS2) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052 [20:36:24] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox) [20:39:47] (03PS1) 10GTirloni: cloudvirt1019: reimage with Stretch [puppet] - 10https://gerrit.wikimedia.org/r/478058 (https://phabricator.wikimedia.org/T196507) [20:41:12] (03PS3) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052 [20:41:17] (03CR) 10GTirloni: [C: 032] cloudvirt1019: reimage with Stretch [puppet] - 10https://gerrit.wikimedia.org/r/478058 (https://phabricator.wikimedia.org/T196507) (owner: 10GTirloni) [20:41:19] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox) [20:42:11] (03CR) 10jerkins-bot: [V: 04-1] profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox) [20:42:52] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.5277 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:42:54] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) >>! In T208622#4804221, @Ottomata wrote: > YI think it will involve custom and locked down rsync modules, but we need to puppetize that somehow... [20:42:54] (03PS4) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052 [20:42:58] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox) [20:45:05] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Andrew) >>! In T196507#4804320, @Cmjohnson wrote: > @andrew yes, you are correct it is the same exact issue. My goal was to work with one, figure out... [20:45:36] !log remove codfw/eqdfw avoid path - T194542 [20:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:55] (03PS7) 10Paladox: httpd::mpm: Also remove php7.0 and php7.2 if not prefork [puppet] - 10https://gerrit.wikimedia.org/r/477587 [20:47:56] (03PS8) 10Paladox: httpd::mpm: Also remove mod_php for 7.0 and 7.2 if not prefork [puppet] - 10https://gerrit.wikimedia.org/r/477587 [20:48:30] (03PS9) 10Paladox: httpd::mpm: Also remove mod_php for 7.0 and 7.2 if not prefork [puppet] - 10https://gerrit.wikimedia.org/r/477587 (https://phabricator.wikimedia.org/T208257) [20:48:39] (03PS10) 10Paladox: httpd::mpm: Also remove mod_php for 7.0 and 7.2 if not prefork [puppet] - 10https://gerrit.wikimedia.org/r/477587 (https://phabricator.wikimedia.org/T208257) [20:48:47] !log reimaging cloudvirt1019 with stretch T196507 [20:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:51] T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 [20:50:32] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.3533 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:51:16] !log remove 2 eqiad avoid path - T194542 [20:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:42] (03PS5) 10Gehel: elasticsearch: configure LVS endpoint for new codfw clusters [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) [20:52:58] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2751 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:54:07] gehel: do we have to care about the above ^ ? [20:54:38] (03CR) 10Dzahn: [C: 031] httpd::mpm: Also remove mod_php for 7.0 and 7.2 if not prefork [puppet] - 10https://gerrit.wikimedia.org/r/477587 (https://phabricator.wikimedia.org/T208257) (owner: 10Paladox) [20:54:48] XioNoX: actually, probably yes (cc herron) [20:55:05] gehel: there seems to be a big uptake on kafka syslogs [20:55:17] I think I know why [20:55:20] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.5613 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:55:35] good :) [20:55:39] XioNoX: it is probably an indication that some service is spewing more logs than usual [20:56:10] (03PS5) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052 [20:56:11] some service feeling alone and needing to talk? [20:56:12] (03CR) 10Dzahn: [C: 031] "makes sense to me but since it's a global httpd module change that can influence a lot i just do +1 for now" [puppet] - 10https://gerrit.wikimedia.org/r/477587 (https://phabricator.wikimedia.org/T208257) (owner: 10Paladox) [20:56:37] kafka seems to log at NOTICE level, that's probably higher than we want (cc ottomata) [20:56:44] (03CR) 10DCausse: [C: 031] elasticsearch: configure LVS endpoint for new codfw clusters [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) (owner: 10Gehel) [20:57:11] herron: if you know, please tell! I'm curious! [20:57:12] I’ll revert this last patch, rsyslog is complaining about the length of the lines and needs maxmessagesize bumped [20:57:39] which afaik must be first in the config, so will have to think about how to accomplish that [20:58:02] (03PS1) 10Herron: Revert "logstash: ship kafka server logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/478060 [20:58:09] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@cbe4551]: Install new Updater with INSERT DATA [20:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:44] (03CR) 10Herron: [C: 032] "reverting because rsyslog default max message size is not large enough to handle these, causing a flood of errors logged by rsyslogd" [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) (owner: 10Herron) [20:59:09] (03CR) 10Herron: [C: 032] Revert "logstash: ship kafka server logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/478060 (owner: 10Herron) [20:59:17] (03PS2) 10Herron: Revert "logstash: ship kafka server logs to ELK" [puppet] - 10https://gerrit.wikimedia.org/r/478060 [21:00:06] herron: kafka still seems to log a lot, 500K messages in the last 15 minutes. I'm pinging people in -analytics [21:00:22] !log remove 2 esams avoid path + 4 prefered/selected transits - T194542 [21:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:21] gehel: how do those messages look? cause event validation errors are being log and 500K in 15 mins sounds very possible [21:02:45] (03CR) 10Dzahn: [C: 04-1] profile::phabricator::httpd: Update's worker config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox) [21:02:49] nuria: logstash now refuses to show me those message :( [21:02:56] * gehel is probably doing something wrong [21:03:44] nuria: for example: `[2018-07-11 19:44:04,942] INFO [ReplicaFetcher replicaId=1001, leaderId=1003, fetcherId=2] Retrying leaderEpoch request for partition eqiad.change-prop.retry.cpjobqueue.retry.mediawiki.job.LocalPageMoveJob-0 as the leader reported an error: UNKNOWN_SERVER_ERROR (kafka.server.ReplicaFetcherThread)` [21:03:50] nuria: eventlogging-error rate seems 4 or 5 per second - A lot less than 500k for 15mins [21:04:18] (03PS1) 10CDanis: grafana-beta.wikimedia.org: add hiera for text varnishes [puppet] - 10https://gerrit.wikimedia.org/r/478062 (https://phabricator.wikimedia.org/T210416) [21:04:23] (03PS1) 10Bstorm: sonofgridengine: point to the actual executable for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/478063 (https://phabricator.wikimedia.org/T211258) [21:04:26] gehel: ok, that seems indeed like too verbose logging going there [21:04:54] (03PS6) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052 [21:05:05] also, it looks like there is a date in the message itself, and it does not match the log event date [21:05:10] joal: it's something like [2018-11-29 15:19:18,236] INFO Deleted offset index \/srv\/kafka\/data\/webrequest_text-5\/00000000059726012114.index.deleted. (kafka.log.LogSegment) [21:05:27] (03CR) 10Paladox: profile::phabricator::httpd: Update's worker config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/478052 (owner: 10Paladox) [21:05:41] 5k lines per second [21:06:11] dcausse: Seems related to data deletion - 7 days of retention --> 2018-11-29 [21:06:14] Oh, it looks like those messages go through syslog, so I assume they flow through systemd [21:06:23] (03CR) 10Urbanecm: [C: 04-1] "Have to change the review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478034 (owner: 10Shreyasminocha) [21:06:39] so there is probably some lag and duplicated heard and wrong parsing [21:06:54] gehel: these were coming from https://gerrit.wikimedia.org/r/476982 which is using imfile [21:06:59] (03PS2) 10Bstorm: sonofgridengine: point to the actual executable for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/478063 (https://phabricator.wikimedia.org/T211258) [21:06:59] gehel, dcausse : *I think* that sounds too like verbose logging. We do not use logstash to alram in any kafka functionality at all. Let's disable loging to logstash completely if it is a nuisance [21:07:08] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [21:07:08] (03PS7) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052 [21:07:14] there isn’t parsing on the message contents, but we can add it [21:07:27] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@cbe4551]: Install new Updater with INSERT DATA (duration: 09m 18s) [21:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:43] herron: I'm wondering if going through journald isn't more pain than it is worth [21:08:00] in terms of extracting fields on those log types. they are going rsyslog -> kafka -> logstash [21:08:12] multi-line logs make it tricky [21:08:33] gehel: ottomata is not here but i think until tomorrow we can do w/o any logstash logging , we really do not use it at all [21:08:34] those messages are structured in whatever logging framework kafka uses, so serializing to text and running grok on that seems more error prone [21:09:14] herron: we should at least serialize that to json instead of text [21:09:49] herron / nuria: do you know how to disable the kafka logs for the time being? [21:10:16] gehel: looking, this is the ticket to enable them, give me a sec: https://phabricator.wikimedia.org/T205437 [21:10:18] they are, reverted the change a short while ago and it’s propagating out [21:10:31] herron: thansk! [21:10:32] I no longer see those in logstash [21:10:32] they are disabled that is [21:11:08] (03CR) 1020after4: [C: 031] phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) (owner: 10Paladox) [21:11:10] once the root cause is solved, shouldn't a new icinga/grafana check be added to catch the issue earlier instead of relying on UDP packets drops? [21:11:26] (03PS8) 10Paladox: profile::phabricator::httpd: Update's worker config [puppet] - 10https://gerrit.wikimedia.org/r/478052 [21:12:01] XioNoX: the rsyslog line length issue? [21:12:02] (03PS4) 10Paladox: phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) [21:12:07] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) (owner: 10Paladox) [21:12:27] (03CR) 10BBlack: [C: 031] "Looks sane to a human!" [puppet] - 10https://gerrit.wikimedia.org/r/478062 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [21:12:42] herron: can you point me to the change you reverted? [21:12:58] no idea, I'm wondering if there is a way to catch similar issues in the future before it cause packets drop [21:13:31] nuria: sure, it was https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476982/ [21:14:15] (03PS1) 10CDanis: grafana-beta.wikimedia.org: add DNS entry for text varnishes [dns] - 10https://gerrit.wikimedia.org/r/478067 (https://phabricator.wikimedia.org/T210416) [21:15:13] (03PS2) 10CDanis: grafana-beta.wikimedia.org: add hiera for text varnishes [puppet] - 10https://gerrit.wikimedia.org/r/478062 (https://phabricator.wikimedia.org/T210416) [21:15:43] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10ayounsi) [21:16:02] (03CR) 10CDanis: [C: 032] grafana-beta.wikimedia.org: add hiera for text varnishes [puppet] - 10https://gerrit.wikimedia.org/r/478062 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [21:16:12] herron: thank you, have reopened our original ticket [21:19:21] np! [21:19:30] XioNoX: in a nutshell this is the reason for migrating to the kafka logging pipeline [21:19:49] cool :) [21:19:49] when complete we’ll be able to turn down udp [21:20:13] as long as you don't drop my precious packets :) [21:20:16] and there will be much celebration [21:20:17] hah [21:20:28] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477960 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella) [21:20:56] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella) [21:21:18] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) Oh that is cool, thanks! > > > [21:21:59] (03CR) 10jerkins-bot: [V: 04-1] Updated InitialiseSettings.php for HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella) [21:23:32] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [21:27:16] (03CR) 10BBlack: [C: 031] grafana-beta.wikimedia.org: add DNS entry for text varnishes [dns] - 10https://gerrit.wikimedia.org/r/478067 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [21:27:28] (03PS3) 10Bstorm: sonofgridengine: point to the actual executable for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/478063 (https://phabricator.wikimedia.org/T211258) [21:28:05] (03CR) 10CDanis: [C: 032] grafana-beta.wikimedia.org: add DNS entry for text varnishes [dns] - 10https://gerrit.wikimedia.org/r/478067 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [21:29:16] (03CR) 10Bstorm: [C: 032] sonofgridengine: point to the actual executable for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/478063 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [21:29:25] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2001.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:29:43] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2002.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:30:08] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) >>! In T211023#4804503, @ops-monitoring-bot wrote: > wmf-decommission-host was executed by robh for elastic2002.codfw.wmnet and performed the following actio... [21:30:36] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10GTirloni) Stretch did not help, battery continues showing as recharging. ` Smart Array P440ar in Slot 0 (Embedded) Cache Serial Number: PDNLH0BRH8... [21:30:42] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2003.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:30:56] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2004.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:31:07] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2005.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:31:22] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2006.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:31:41] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2007.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:32:36] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@f675fcc]: Added performer to the revision-scores event [21:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:43] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [21:32:43] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2008.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:32:53] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2009.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:33:43] (03PS1) 10GTirloni: cloudvirt1019: reimage with Jessie [puppet] - 10https://gerrit.wikimedia.org/r/478098 (https://phabricator.wikimedia.org/T196507) [21:33:47] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2010.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:33:51] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@f675fcc]: Added performer to the revision-scores event (duration: 01m 15s) [21:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:57] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2011.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:34:07] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2012.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:34:20] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2013.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:34:28] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2014.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:34:35] (03CR) 10GTirloni: [C: 032] cloudvirt1019: reimage with Jessie [puppet] - 10https://gerrit.wikimedia.org/r/478098 (https://phabricator.wikimedia.org/T196507) (owner: 10GTirloni) [21:34:37] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2015.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:34:45] 10Operations, 10ops-eqiad: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10Cmjohnson) Some cameras have been re-connected as I am in their racks, others will need me to run new cables to reach the new switches. Some progress as I get the chance. Front of Cage Camera Rows A/B ->... [21:34:47] so yeah, that script is going to spam this channel 24 times total. [21:34:49] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2016.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:35:00] once per host. [21:35:01] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2017.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:35:10] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2018.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:35:15] robh and your going to be pinged 24 times :P [21:35:20] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2019.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:35:30] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2020.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:35:39] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [21:35:40] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2021.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:35:51] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2022.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:35:56] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) @Banyek another Q: Can we add permissions to the recommendationapi user on m2-master to be able to connect from stat1007? This might not be... [21:36:02] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2023.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:36:29] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for elastic2024.codfw.wmnet and performed the following actions: - Revoked Puppet certificate - Remo... [21:37:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [21:38:27] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) For posterity, I did the following from neodymium for Baho: ` python3 deploy.py import_languages 20181130 m2-master.eqiad.wmnet 3306 recomme... [21:39:25] !log reimaging cloudvirt1019 with jessie T196507 [21:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:29] T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 [21:40:27] (03PS1) 10CDanis: Temporarily override grafan1001's HTTP serving domain to grafana-beta.wikimedia.org. Once we are happy with the migration we can re-point grafana.w.o Varnishes to it and simply remove this file. [puppet] - 10https://gerrit.wikimedia.org/r/478099 (https://phabricator.wikimedia.org/T210416) [21:40:52] (03CR) 10Ottomata: [C: 032] EventLogging Logstash filter: move useful fields out of event [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [21:40:59] (03PS4) 10Ottomata: EventLogging Logstash filter: move useful fields out of event [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [21:41:02] (03CR) 10jerkins-bot: [V: 04-1] Temporarily override grafan1001's HTTP serving domain to grafana-beta.wikimedia.org. Once we are happy with the migration we can re-point grafana.w.o Varnishes to it and simply remove this file. [puppet] - 10https://gerrit.wikimedia.org/r/478099 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [21:41:15] 10Operations, 10SRE-Access-Requests: Requesting access to Proton for pmiazga, bearND, Mholloway, MSantos, Tgr - https://phabricator.wikimedia.org/T211382 (10pmiazga) [21:41:27] (03CR) 10Ottomata: [V: 032 C: 032] EventLogging Logstash filter: move useful fields out of event [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [21:41:52] (03PS2) 10CDanis: grafana1001: answer for grafana-beta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/478099 (https://phabricator.wikimedia.org/T210416) [21:42:55] (03PS3) 10CDanis: grafana1001: answer for grafana-beta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/478099 (https://phabricator.wikimedia.org/T210416) [21:43:22] 10Operations, 10SRE-Access-Requests: Requesting access to Proton for pmiazga, bearND, Mholloway, MSantos, Tgr - https://phabricator.wikimedia.org/T211382 (10pmiazga) [21:43:25] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) asw-a-codfw:ge-5/0/8 elastic2001 asw-a-codfw:ge-5/0/20 elastic2002 asw-a-codfw:ge-5/0/21 elastic2003 asw-a-codfw:ge-8/0/3 elastic2004 asw-a-codfw:ge-8/0/4 el... [21:43:30] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [21:44:27] (03CR) 10ArielGlenn: [C: 031] "Um a quick note about the commit message, it's the labstore boxes (dumps web/nfs servers), not the labsdbs that are involved." [puppet] - 10https://gerrit.wikimedia.org/r/478022 (https://phabricator.wikimedia.org/T211330) (owner: 10Elukey) [21:44:39] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [21:45:09] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [21:45:37] 10Operations, 10SRE-Access-Requests: Requesting access to Proton for pmiazga, bearND, Mholloway, MSantos, Tgr - https://phabricator.wikimedia.org/T211382 (10pmiazga) @bearND @Mholloway @MSantos @Tgr could you edit the task and put your shell usernames here please? @mobrovac could you approve the request? Also... [21:50:15] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [21:50:32] 10Operations, 10Cloud-Services, 10Patch-For-Review: Setting up bulk proxies pointing to a multiwiki mediawiki-vagrant setup running on a labs vm - https://phabricator.wikimedia.org/T132216 (10bd808) 05Open>03Resolved a:03bd808 Closing this out. 2.5 years with no updates so... yeah. [21:50:36] (03CR) 10CDanis: [C: 032] grafana1001: answer for grafana-beta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/478099 (https://phabricator.wikimedia.org/T210416) (owner: 10CDanis) [21:53:14] (03CR) 10Ottomata: [V: 032 C: 032] "Wow, my claim about the raw event sometimes being JSON was wrong. It shoudl be though! Fixing: https://gerrit.wikimedia.org/r/#/c/eventl" [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [21:56:33] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 82.23 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [21:56:50] (03PS1) 10EBernhardson: Turn off wbsearchentities test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478103 (https://phabricator.wikimedia.org/T209402) [22:05:00] 10Operations, 10SRE-Access-Requests: Requesting access to Proton for pmiazga, bearND, Mholloway, MSantos, Tgr - https://phabricator.wikimedia.org/T211382 (10pmiazga) [22:09:49] 10Operations, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) [22:11:02] 10Operations, 10SRE-Access-Requests: Requesting access to Proton for pmiazga, bearND, Mholloway, MSantos, Tgr - https://phabricator.wikimedia.org/T211382 (10Mholloway) [22:12:09] !log decommissioning cassandra-b, restbase2001 -- T210843 [22:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:13] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [22:13:43] (03PS1) 10RobH: decom elastic2001-2024 [puppet] - 10https://gerrit.wikimedia.org/r/478105 (https://phabricator.wikimedia.org/T211023) [22:14:30] (03CR) 10RobH: [C: 032] decom elastic2001-2024 [puppet] - 10https://gerrit.wikimedia.org/r/478105 (https://phabricator.wikimedia.org/T211023) (owner: 10RobH) [22:15:15] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) [22:17:10] (03PS1) 10RobH: decom elastic2001-2024 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/478106 (https://phabricator.wikimedia.org/T211023) [22:17:49] (03CR) 10RobH: [C: 032] decom elastic2001-2024 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/478106 (https://phabricator.wikimedia.org/T211023) (owner: 10RobH) [22:21:33] 10Operations, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) [22:22:08] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [22:23:28] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) a:05RobH>03Papaul Ok, these are now ready for SSD wipe. Please note, since they are SSDs, a wipe (write zeros) won't work, and the hdparm ut... [22:49:44] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10cscott) If you could provide more details, I'd certainly be interested in helping debug the XPath library in... [22:54:22] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) p:05Normal>03High [22:54:48] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10RobH) [22:55:05] (03PS2) 10Stella: Updated InitialiseSettings.php for HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618) [22:58:45] !log ppchelko@deploy1001 Started deploy [restbase/deploy@be8f0c0]: Add 'morelike' recommendation public API specification T201192 [22:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:49] T201192: Build API to surface 'morelike' article recommendations for missing articles - https://phabricator.wikimedia.org/T201192 [23:06:19] (03PS1) 10Andrew Bogott: Openstack: monitor nova and kvm on cloudvirt hosts [puppet] - 10https://gerrit.wikimedia.org/r/478113 (https://phabricator.wikimedia.org/T211388) [23:10:49] (03PS1) 10Dzahn: interface: use new data type Stdlib::Ip_address [puppet] - 10https://gerrit.wikimedia.org/r/478114 [23:11:43] 10Operations, 10Traffic, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) So, as of December 6th, there are no new emails to EquinixMaintenance.SG@ap.equinix.com since our last email/update request to Vivian. Once... [23:12:48] (03PS2) 10Andrew Bogott: Openstack: monitor nova and kvm on cloudvirt hosts [puppet] - 10https://gerrit.wikimedia.org/r/478113 (https://phabricator.wikimedia.org/T211388) [23:12:50] (03PS1) 10Andrew Bogott: Disable alerting on cloudvirt1019 and 1020 [puppet] - 10https://gerrit.wikimedia.org/r/478115 (https://phabricator.wikimedia.org/T196507) [23:15:24] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella) [23:16:40] (03CR) 10Andrew Bogott: [C: 032] Disable alerting on cloudvirt1019 and 1020 [puppet] - 10https://gerrit.wikimedia.org/r/478115 (https://phabricator.wikimedia.org/T196507) (owner: 10Andrew Bogott) [23:17:07] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478055 (https://phabricator.wikimedia.org/T150618) (owner: 10Stella) [23:21:32] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@be8f0c0]: Add 'morelike' recommendation public API specification T201192 (duration: 22m 46s) [23:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:36] T201192: Build API to surface 'morelike' article recommendations for missing articles - https://phabricator.wikimedia.org/T201192 [23:25:02] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Kanban (Done with CPT), and 2 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10bmansurov) [23:25:42] (03PS1) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [23:26:24] 10Operations, 10DBA, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10bmansurov) [23:29:35] (03PS2) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [23:30:00] (03CR) 10Paladox: gerrit: add data types for all parameters (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [23:30:14] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov) [23:37:08] (03PS3) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [23:37:10] (03CR) 10Dzahn: gerrit: add data types for all parameters (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [23:38:02] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [23:38:13] 10Operations, 10monitoring, 10Patch-For-Review, 10User-CDanis: Upgrade grafana to 5.x - https://phabricator.wikimedia.org/T210416 (10CDanis) [23:43:43] (03CR) 10Dzahn: [C: 04-1] "the syntax errors should be gone after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475261/ needs to wait a bit.. don't want to" [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [23:45:45] !log ppchelko@deploy1001 Started deploy [recommendation-api/deploy@299b268]: Add 'morelike' article recommendations API T201192 [23:45:47] 10Operations, 10Icinga, 10Patch-For-Review: move icinga contacts file to public repo - https://phabricator.wikimedia.org/T164238 (10Dzahn) a:05Dzahn>03None [23:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:48] T201192: Build API to surface 'morelike' article recommendations for missing articles - https://phabricator.wikimedia.org/T201192 [23:47:03] (03Abandoned) 10Dzahn: icinga: test creating individual contact secrets [puppet] - 10https://gerrit.wikimedia.org/r/391980 (https://phabricator.wikimedia.org/T164238) (owner: 10Dzahn) [23:47:15] !log troubleshoot bird bfd on dns2001/cr1-codfw [23:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:51] !log ppchelko@deploy1001 Finished deploy [recommendation-api/deploy@299b268]: Add 'morelike' article recommendations API T201192 (duration: 02m 06s) [23:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:38] (03PS1) 10Dzahn: nagios_common: remove commented section about contacts test file [puppet] - 10https://gerrit.wikimedia.org/r/478118 (https://phabricator.wikimedia.org/T164238) [23:48:55] 10Operations, 10Icinga, 10monitoring: move icinga contacts file to public repo - https://phabricator.wikimedia.org/T164238 (10Dzahn) [23:49:34] (03CR) 10Dzahn: [C: 032] "just cleaning up what i added back in 2017 and is commented" [puppet] - 10https://gerrit.wikimedia.org/r/478118 (https://phabricator.wikimedia.org/T164238) (owner: 10Dzahn) [23:50:51] 10Operations, 10MediaWiki-Debug-Logger, 10Performance-Team: Set up request profiling for PHP 7 - https://phabricator.wikimedia.org/T206152 (10tstarling) Please install tideways, but it should only be enabled in php.ini on the debug servers, since it will cause a performance degradation even without being use... [23:53:15] (03PS2) 10Dzahn: ci::website: convert apache to httpd [puppet] - 10https://gerrit.wikimedia.org/r/453554 [23:58:06] ste1la: welcome, new contributor