[00:00:04] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T0000). [00:02:08] (03PS8) 10Jeena Huneidi: Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909) [00:03:11] (03CR) 10Jeena Huneidi: Add Parsoid chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909) (owner: 10Jeena Huneidi) [00:04:05] (03PS9) 10Jeena Huneidi: Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909) [00:22:11] * Krinkle staginb on mwdebug1002 [00:28:22] !log krinkle@deploy1001 sync-file aborted: composer.json composer.lock dblists debug.json docroot errorpages fc-list fonts images langlist langlist-labs multiversion php php-1.34.0-wmf.13 php-1.34.0-wmf.14 php-1.34.0-wmf.15 php-1.34.0-wmf.16 phpcs.xml phpunit.xml portals private README requirements.txt robots.txt rpc scap setup.py src static test-requirements.txt tests tox.ini typos vendor w wikiversions.json wikiversions-labs.js [00:28:22] fig List of module names that contain QUnit test suites (duration: 00m 01s) [00:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:32] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.16/includes/resourceloader/ResourceLoader.php: acfff6751f3b8f7650 (duration: 00m 55s) [00:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:01] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.16/includes/specials/SpecialJavaScriptTest.php: acfff6751f3b8f7650 (duration: 00m 54s) [00:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:08] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.16/resources/Resources.php: acfff6751f3b8f7650 (duration: 00m 54s) [00:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:17] train is presently still blocked on T229482. i'm signing off and will resume efforts to move forward at 08:00 MDT / 14:00 UTC. [00:36:18] T229482: PHP Warning: Wikibase\Lib\Store\Sql\WikiPageEntityRevisionLookup::getEntityRevision: Entity not loaded - https://phabricator.wikimedia.org/T229482 [00:50:51] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:10:15] (03PS1) 10Ayounsi: Prometheus, collect Netbox metrics [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) [01:18:45] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:26:48] (03PS2) 10Ayounsi: Prometheus, collect Netbox metrics [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) [01:33:03] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/17700/" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi) [01:40:40] (03CR) 10Ayounsi: "This doesn't work." [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi) [01:57:27] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:12:21] PROBLEM - puppet last run on wtp1032 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:25:27] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:40:19] RECOVERY - puppet last run on wtp1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:59:03] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 50.21, 24.55, 13.98 https://wikitech.wikimedia.org/wiki/Application_servers [02:59:13] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 53.53, 25.34, 13.86 https://wikitech.wikimedia.org/wiki/Application_servers [02:59:25] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 56.21, 27.80, 15.18 https://wikitech.wikimedia.org/wiki/Application_servers [03:00:11] PROBLEM - Nginx local proxy to apache on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:00:41] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 13.21, 19.68, 13.37 https://wikitech.wikimedia.org/wiki/Application_servers [03:00:49] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20580336 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:00:53] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 18.52, 22.39, 14.07 https://wikitech.wikimedia.org/wiki/Application_servers [03:00:53] PROBLEM - HHVM rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:03] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 19.93, 23.87, 15.06 https://wikitech.wikimedia.org/wiki/Application_servers [03:01:27] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out [03:01:27] was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{doma [03:01:27] m/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:01:33] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [03:01:33] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:01:35] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [03:01:35] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:01:35] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [03:01:35] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/ [03:01:35] itoring/recommendation_api [03:01:37] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:01:37] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa [03:01:37] ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:01:37] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was receiv [03:01:38] article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:01:38] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received h [03:01:39] ikimedia.org/wiki/Services/Monitoring/mobileapps [03:01:45] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [03:01:45] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:01:45] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [03:01:46] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:01:46] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [03:01:46] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/ [03:01:46] itoring/recommendation_api [03:01:57] PROBLEM - Nginx local proxy to apache on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:57] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:57] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:59] PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:59] PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:02:05] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:09] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:02:09] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:02:09] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:13] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:02:13] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:15] PROBLEM - HHVM rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:02:17] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:02:19] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 73.36, 40.89, 22.90 https://wikitech.wikimedia.org/wiki/Application_servers [03:02:29] PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:02:39] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:02:39] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:47] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:03:01] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:03:09] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:03:09] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:11] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:03:11] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:03:13] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:03:13] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:03:15] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:19] RECOVERY - Nginx local proxy to apache on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 660 bytes in 0.234 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:19] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:03:19] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:03:25] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 658 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:25] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 659 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:27] RECOVERY - Nginx local proxy to apache on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.574 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:29] RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 659 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:29] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 76068 bytes in 1.393 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:31] PROBLEM - High CPU load on API appserver on mw1316 is CRITICAL: CRITICAL - load average: 78.77, 47.30, 28.27 https://wikitech.wikimedia.org/wiki/Application_servers [03:03:33] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 76114 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:37] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:37] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 76067 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:41] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:43] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:43] RECOVERY - HHVM rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 76067 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:45] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:45] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:57] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:59] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 56.22, 30.20, 18.09 https://wikitech.wikimedia.org/wiki/Application_servers [03:04:01] RECOVERY - HHVM rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 76114 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:04:05] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19283016 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:04:09] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 69.40, 35.61, 19.74 https://wikitech.wikimedia.org/wiki/Application_servers [03:04:11] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:04:13] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:04:19] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:04:33] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:04:43] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:04:45] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 55.65, 33.97, 20.18 https://wikitech.wikimedia.org/wiki/Application_servers [03:04:45] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:04:55] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:05:05] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 64.86, 36.46, 20.50 https://wikitech.wikimedia.org/wiki/Application_servers [03:05:07] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 92137704 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:05:21] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:05:33] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 13.33, 26.71, 20.76 https://wikitech.wikimedia.org/wiki/Application_servers [03:05:51] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:06:01] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:06:45] RECOVERY - High CPU load on API appserver on mw1316 is OK: OK - load average: 16.22, 32.03, 25.94 https://wikitech.wikimedia.org/wiki/Application_servers [03:07:11] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 12.92, 24.69, 18.59 https://wikitech.wikimedia.org/wiki/Application_servers [03:07:23] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 10.48, 23.75, 18.18 https://wikitech.wikimedia.org/wiki/Application_servers [03:07:37] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:07:53] PROBLEM - puppet last run on an-worker1078 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:07:59] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 11.42, 22.91, 18.46 https://wikitech.wikimedia.org/wiki/Application_servers [03:08:31] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 80427384 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:09:53] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 8.04, 20.70, 18.30 https://wikitech.wikimedia.org/wiki/Application_servers [03:09:55] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 124600 and 37 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:10:31] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 46624 and 73 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:13:25] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3104 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:14:31] (03PS2) 10Ejegg: CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) [03:30:19] RECOVERY - puppet last run on an-worker1078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:32:33] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:33:19] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:33:53] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:34:05] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:35:35] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:08:23] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:18:51] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:02] 10Operations, 10netops, 10observability: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 (10ayounsi) Good news, this is already implemented with: https://github.com/librenms/librenms/pull/9879 Bad news, for unknown reasons so far, the switches don't expose the proper interface data. For... [04:36:29] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:44:46] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) Thanks @Papaul I have started MySQL again, let's monitor the host for a few days [04:46:31] (03CR) 10Marostegui: [C: 03+2] DNS: Remove DNS entires for db2042 [dns] - 10https://gerrit.wikimedia.org/r/526762 (owner: 10Papaul) [04:48:29] (03CR) 10Marostegui: "Ooooh sweet! Thanks! :)" [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis) [04:51:35] (03PS1) 10Marostegui: db2058: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526836 (https://phabricator.wikimedia.org/T229449) [04:52:11] 10Operations, 10ops-codfw, 10Patch-For-Review: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T229449 (10Marostegui) As expected, controller failure: ` /system1/log1/record14 Targets Properties number=14 severity=Critical date=07/31/2019 time=16:51 description=Drive Array... [04:52:54] (03CR) 10Marostegui: [C: 03+2] db2058: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526836 (https://phabricator.wikimedia.org/T229449) (owner: 10Marostegui) [04:53:05] PROBLEM - puppet last run on dns5002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:54:34] 10Operations, 10DBA: db2058: Broken storage - https://phabricator.wikimedia.org/T229449 (10Marostegui) [04:55:11] 10Operations, 10DBA: db2058: Broken storage - https://phabricator.wikimedia.org/T229449 (10Marostegui) 05Open→03Declined I am going to close this as this host will be decommissioned {T228258} [04:59:19] 10Operations, 10DBA, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Marostegui) [04:59:49] 10Operations, 10DBA, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Marostegui) p:05Triage→03Normal [04:59:58] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:04:25] (03PS1) 10Marostegui: filtered_tables: Remove abuse_filter_log.afl_log_id [puppet] - 10https://gerrit.wikimedia.org/r/526837 (https://phabricator.wikimedia.org/T226851) [05:14:33] (03CR) 10Marostegui: [C: 03+2] filtered_tables: Remove abuse_filter_log.afl_log_id [puppet] - 10https://gerrit.wikimedia.org/r/526837 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [05:20:21] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [05:21:07] RECOVERY - puppet last run on dns5002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:22:59] (03PS1) 10Marostegui: db212[5-6]: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526838 (https://phabricator.wikimedia.org/T228969) [05:24:50] (03CR) 10Marostegui: [C: 03+2] db212[5-6]: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526838 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:47:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526839 [05:52:55] (03CR) 10Giuseppe Lavagetto: "The code seems correct to me but I'm not sure we want every commit to have a message, esp if we're in the middle of a series of commits an" [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis) [06:11:54] (03PS1) 10Vgutierrez: acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] - 10https://gerrit.wikimedia.org/r/526840 (https://phabricator.wikimedia.org/T229096) [06:17:37] (03PS7) 10Giuseppe Lavagetto: mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) [06:18:38] <_joe_> !log depooling mw1348 while moving it to no hhvm support. [06:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:56] !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1348.eqiad.wmnet [06:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [06:21:24] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:22:22] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:28:34] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw1348.eqiad.wmnet [06:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:13] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:11] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:27] PROBLEM - puppet last run on db2115 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:35:31] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:36:10] (03PS1) 10Elukey: Remove Spark2 sasl config from the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/526843 (https://phabricator.wikimedia.org/T226698) [06:36:49] (03CR) 10Elukey: [C: 03+2] Remove Spark2 sasl config from the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/526843 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [06:40:37] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:40] (03PS2) 10Giuseppe Lavagetto: mediawiki: make mw1270 a php7-only application server [puppet] - 10https://gerrit.wikimedia.org/r/526720 [06:42:03] <_joe_> !log depooling mw1270 while migrating it to pure-php7 [06:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:21] !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1270.eqiad.wmnet [06:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: make mw1270 a php7-only application server [puppet] - 10https://gerrit.wikimedia.org/r/526720 (owner: 10Giuseppe Lavagetto) [06:48:20] (03PS1) 10Elukey: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849 [06:49:02] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe) [06:49:31] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe) [06:50:29] (03PS2) 10Elukey: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849 [06:50:45] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe) [06:51:44] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw1270.eqiad.wmnet [06:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:20] (03PS3) 10Elukey: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849 [06:58:30] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:58:44] RECOVERY - puppet last run on db2115 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:59:04] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:59:44] !log install python3-docopt manually on lithium to test check_anycast_healthchecker [06:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:40] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:04:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:07:00] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10elukey) The link went down again: ` elukey@re0.cr2-eqiad> show interfaces descriptions | match down xe-4/1/3 up down Transport: cr2-esams:xe-0/1/3 (Level3, BDFS2448,... [07:07:31] elukey: the router interface is the same you reported earlier? [07:07:33] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526839 (owner: 10Marostegui) [07:07:50] never mind, I should read more the timestamp around messages :D [07:07:53] * volans still waking up [07:08:25] volans: yep yep commented in security [07:08:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526839 (owner: 10Marostegui) [07:08:42] from the task it seems all good, Arzel drained the link [07:08:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526839 (owner: 10Marostegui) [07:09:20] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:09:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1112 (duration: 00m 54s) [07:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:23] !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8843', previous config saved to /var/cache/conftool/dbconfig/20190801-071022-marostegui.json [07:10:24] (03PS4) 10Elukey: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849 [07:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:03] (03CR) 10Marostegui: "> The code seems correct to me but I'm not sure we want every commit" [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis) [07:13:15] (03PS5) 10Elukey: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849 [07:16:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:48] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:52] (03PS1) 10Marostegui: mariadb: Provision db2129 into s6 [puppet] - 10https://gerrit.wikimedia.org/r/526935 (https://phabricator.wikimedia.org/T228969) [07:18:58] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2129 into s6 [puppet] - 10https://gerrit.wikimedia.org/r/526935 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:25:14] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10elukey) 05Resolved→03Open Couple of notes about the anycast-healthchecker: 1) the `anycast-healthchecker` is not in jessie-wikimedia, so puppet on lithium/wezen is currently broken: ` r... [07:25:30] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10elukey) [07:25:32] ACKNOWLEDGEMENT - Check if anycast-healthchecker and all configured threads are running on lithium is CRITICAL: NRPE: Command check_anycast_healthchecker not defined Elukey T186550 https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS%23Anycast_healthchecker_not_running [07:27:35] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Provision db2126 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526944 (https://phabricator.wikimedia.org/T228969) [07:27:43] <_joe_> !log removing mw1348 from rotation - reimaging for T228976 [07:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:52] T228976: Allow to avoid installing HHVM from the mediawiki puppet module and profile - https://phabricator.wikimedia.org/T228976 [07:29:12] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=mw1348.eqiad.wmnet [07:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:54] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10elukey) Added patch to the Debian bug in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763858#10 [07:31:52] (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526944 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:34:40] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Provision db2126 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526944 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:35:00] !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8844', previous config saved to /var/cache/conftool/dbconfig/20190801-073459-marostegui.json [07:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:06] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db2126 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526944 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:37:19] (03CR) 10Volans: [C: 03+1] "LGTM" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/526840 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez) [07:37:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db2126 into s2 T228969 (duration: 00m 54s) [07:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:40] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [07:38:26] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2126 into s2 T228969 (duration: 00m 55s) [07:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:23] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db2126 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526944 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [08:20:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909) (owner: 10Jeena Huneidi) [08:20:51] 10Operations, 10Security-Team, 10Traffic: Consider removing X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10ema) @dduvall: any reason not to proceed with the removal? [08:31:19] 10Operations, 10Traffic, 10Patch-For-Review, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Volans) @Joe any feedback on the above proposal? I'd really like to split the users ASAP given that dbctl is being deployed. [08:39:52] (03CR) 10Volans: "I'm ok with the UI of requiring the message and !logging for each write action to the mwconfig object read by MW." (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis) [08:48:08] (03PS1) 10Filippo Giunchedi: site: lithium to spare [puppet] - 10https://gerrit.wikimedia.org/r/526980 (https://phabricator.wikimedia.org/T229557) [08:48:33] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done, decom for lithium is at {T229557} [08:53:51] (03PS4) 10Filippo Giunchedi: toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) [09:00:38] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10aborrero) The dates you mention the WMCS team will be barely available because travel/wikimania/offsites, etc. Since the racks are "easy" for us, this shouldn't be a bloc... [09:02:44] (03CR) 10Marostegui: [C: 03+1] "Let's deploy carefully?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [09:02:46] (03PS5) 10Filippo Giunchedi: toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) [09:02:52] (03CR) 10Volans: "Couple of comments/questions inline, looks good otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521313 (owner: 10CRusnov) [09:08:42] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10elukey) 05Resolved→03Open p:05High→03Normal [09:08:47] 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) [09:09:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10elukey) We didn't discuss if SERVICE UNKNOWN needs to alarm or not for some services :) [09:13:06] (03PS1) 10Urbanecm: flaggedrevs.php: Remove useless wgAddGroups/wgRemoveGroups declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527003 [09:15:26] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi) [09:16:11] 10Operations, 10Traffic: fifo-log-tailer: evergrowing memory usage - https://phabricator.wikimedia.org/T229414 (10ema) I've been digging a bit further and reproduced this on my workstation with the following program: `lang=go // growmem.go package main import ( "io/ioutil" "os" ) func main()... [09:17:37] (03PS6) 10Filippo Giunchedi: toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) [09:19:50] (03CR) 10Alexandros Kosiaris: restrouter: Add helmfile stanzas (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/526719 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [09:19:54] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/17705/" [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [09:20:06] (03PS2) 10Alexandros Kosiaris: restrouter: Add helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/526719 (https://phabricator.wikimedia.org/T223953) [09:21:13] 10Operations, 10DBA: db2058: Broken storage - https://phabricator.wikimedia.org/T229449 (10Marostegui) I rebooted the server and this is the boot message: ` Slot 0 HP Smart Array P420i Controller (1 GB, v6.00) 1 Logical Drive 1719-Slot 0 Drive Array - A controller failure event occurred prior to this... [09:21:52] RECOVERY - MariaDB disk space on db2058 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [09:22:40] RECOVERY - Disk space on db2058 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2058&var-datasource=codfw+prometheus/ops [09:24:55] (03PS1) 10Vgutierrez: fifo-log-demux: Keep attempting to read the FIFO after EOF [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 [09:25:06] RECOVERY - HP RAID on db2058 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:29:52] jouncebot, next [09:29:52] In 1 hour(s) and 30 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1100) [09:34:46] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall, see comment re: metric names" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526782 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [09:35:34] RECOVERY - Check systemd state on db2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:20] RECOVERY - MariaDB Slave IO: s6 on db2058 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:41:16] RECOVERY - MariaDB Slave SQL: s6 on db2058 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:50:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/526791 (owner: 10Ayounsi) [09:51:09] (03PS2) 10Jbond: urbanecm's dotfiles: gitconfig: Add push-for-review, use SSH for pushing [puppet] - 10https://gerrit.wikimedia.org/r/526796 (owner: 10Urbanecm) [09:52:10] (03CR) 10Jbond: [C: 03+2] urbanecm's dotfiles: gitconfig: Add push-for-review, use SSH for pushing [puppet] - 10https://gerrit.wikimedia.org/r/526796 (owner: 10Urbanecm) [09:59:30] (03PS3) 10Alexandros Kosiaris: restrouter: Add helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/526719 (https://phabricator.wikimedia.org/T223953) [09:59:57] (03PS4) 10Alexandros Kosiaris: restrouter: Add kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/526632 (https://phabricator.wikimedia.org/T223953) [10:04:11] (03CR) 10Jbond: "looks good, one nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526849 (owner: 10Elukey) [10:08:44] (03CR) 10Filippo Giunchedi: [C: 04-1] "Thanks for working on it! See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526611 (https://phabricator.wikimedia.org/T229357) (owner: 10Elukey) [10:10:00] <_joe_> !log repooling mw1348 after reimaging as pure-php7 [10:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:23] !log rolling upgrade for patch [10:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:37] (03PS2) 10Lucas Werkmeister (WMDE): vcl: add Access-Control-Allow-Origin to mobile redirects [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) [10:12:39] (03CR) 10Lucas Werkmeister (WMDE): vcl: add Access-Control-Allow-Origin to mobile redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) (owner: 10Lucas Werkmeister (WMDE)) [10:18:04] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10fgiunchedi) Thanks @elukey ! Indeed anycast-healthchecker isn't in jessie-wikimedia, lithium is being decom'd and if wezen gets reinstalled it'll be buster, and I installed anycast-healthche... [10:19:27] (03PS2) 10Vgutierrez: fifo-log-demux: Keep attempting to read the FIFO after EOF [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 [10:19:29] (03PS1) 10Vgutierrez: fifo-log-demux: Deprecate socket activation [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527039 [10:22:41] (03PS1) 10Jbond: mysql: remove grants for sarin and neodymium [puppet] - 10https://gerrit.wikimedia.org/r/527043 [10:24:19] (03PS2) 10Jbond: mysql: remove grants for sarin and neodymium [puppet] - 10https://gerrit.wikimedia.org/r/527043 (https://phabricator.wikimedia.org/T220503) [10:24:34] (03PS3) 10Jbond: mysql: remove grants for sarin and neodymium [puppet] - 10https://gerrit.wikimedia.org/r/527043 (https://phabricator.wikimedia.org/T220503) [10:29:33] (03PS2) 10Vgutierrez: fifo-log-demux: Remove socket activation [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527039 [10:29:35] (03PS3) 10Vgutierrez: fifo-log-demux: Keep attempting to read the FIFO after EOF [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 [10:30:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the LBRemoteCluster class. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [10:30:52] (03CR) 10Ema: [C: 03+1] "Looks good, as a reminder we should get rid of socket activation support from puppet too." [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527039 (owner: 10Vgutierrez) [10:32:34] (03PS15) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [10:35:03] (03PS1) 10Ladsgroup: Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527052 [10:35:35] (03CR) 10Volans: [C: 03+1] "LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [10:36:33] (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [10:39:02] (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527056 (https://phabricator.wikimedia.org/T228657) [10:40:41] (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527056 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [10:40:58] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527058 (https://phabricator.wikimedia.org/T229543) [10:46:19] (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527058 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui) [10:47:14] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527058 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui) [10:47:45] 10Operations, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Marostegui) [10:48:15] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527058 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui) [10:48:30] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527058 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui) [10:50:13] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2058 from config T229543 (duration: 00m 55s) [10:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:21] T229543: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 [10:50:25] (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527064 (https://phabricator.wikimedia.org/T228657) [10:51:02] (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527064 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [10:51:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2058 from config T229543 (duration: 00m 57s) [10:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:11] ACKNOWLEDGEMENT - puppet last run on bast4002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. John Bond Testing against new puppet master https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:58:13] (03PS16) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1100). [11:00:04] kart_, Urbanecm, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:21] o/ [11:00:22] I can SWAT today! [11:00:25] Urbanecm: I've wmf.16 patch, you can go ahead with your config patches first. [11:00:32] ack [11:00:38] Urbanecm: If you can do my patch, that's great. Already +2ed though. [11:00:51] kart_, if you want me to, happy to SWAT yours too! [11:01:02] Urbanecm: Please do :) [11:01:05] ok [11:01:12] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526492 (https://phabricator.wikimedia.org/T229346) (owner: 10Urbanecm) [11:02:13] (03Merged) 10jenkins-bot: flaggedrevs.php: Allow wikis to remove ability to promote to/demote from autoreview/editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526492 (https://phabricator.wikimedia.org/T229346) (owner: 10Urbanecm) [11:02:30] (03CR) 10jenkins-bot: flaggedrevs.php: Allow wikis to remove ability to promote to/demote from autoreview/editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526492 (https://phabricator.wikimedia.org/T229346) (owner: 10Urbanecm) [11:02:32] (03PS2) 10Urbanecm: flaggedrevs.php: Remove useless wgAddGroups/wgRemoveGroups declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527003 [11:02:36] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527003 (owner: 10Urbanecm) [11:02:53] pulled the merged patch onto mwdebug1002 [11:03:34] syncing [11:05:08] !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: aa82657: flaggedrevs.php: Allow wikis to remove ability to promote to/demote from autoreview/editor (T229346) (duration: 00m 54s) [11:05:14] (03PS1) 10Marostegui: mariadb: Specify candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/527073 [11:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:18] T229346: Administrators of the Hungarian Wikipedia have unapproved right - https://phabricator.wikimedia.org/T229346 [11:06:00] (03Merged) 10jenkins-bot: flaggedrevs.php: Remove useless wgAddGroups/wgRemoveGroups declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527003 (owner: 10Urbanecm) [11:07:52] (03PS1) 10Vgutierrez: fifo_log_demux: Remove socket activation [puppet] - 10https://gerrit.wikimedia.org/r/527075 [11:07:54] (03CR) 10jenkins-bot: flaggedrevs.php: Remove useless wgAddGroups/wgRemoveGroups declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527003 (owner: 10Urbanecm) [11:09:37] !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: 7db98f3: flaggedrevs.php: Remove useless wgAddGroups/wgRemoveGroups declarations (duration: 00m 55s) [11:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:29] (03CR) 10Vgutierrez: "pcc seems happy almost showing a NOOP: https://puppet-compiler.wmflabs.org/compiler1002/17707/" [puppet] - 10https://gerrit.wikimedia.org/r/527075 (owner: 10Vgutierrez) [11:11:46] (03CR) 10Marostegui: "noop as expected: https://puppet-compiler.wmflabs.org/compiler1001/17708/" [puppet] - 10https://gerrit.wikimedia.org/r/527073 (owner: 10Marostegui) [11:11:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Specify candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/527073 (owner: 10Marostegui) [11:11:59] (03CR) 10Volans: [C: 03+1] "LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [11:12:04] kart_, Amir1: Done with my patches, waiting on CI for kart_'s backport. Amir1, do you want me to deploy your patch, or do you prefer deploying it yourself? [11:12:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [11:14:05] kart_, your backport is merged [11:14:06] processing it [11:14:51] kart_, your patch is on mwdebug1002 [11:15:11] cool. [11:15:22] Nothing to check. Go ahead :) [11:15:48] ok kart_ [11:16:16] (03PS1) 10Jbond: puppetmaster: Fix regression [puppet] - 10https://gerrit.wikimedia.org/r/527078 (https://phabricator.wikimedia.org/T228657) [11:16:51] (03PS2) 10Jbond: puppetmaster: Fix regression [puppet] - 10https://gerrit.wikimedia.org/r/527078 (https://phabricator.wikimedia.org/T228657) [11:16:54] Urbanecm: it would be great if you deploy it [11:17:49] Amir1, will do [11:17:52] (03CR) 10Jbond: [C: 03+2] puppetmaster: Fix regression [puppet] - 10https://gerrit.wikimedia.org/r/527078 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [11:18:11] (03Merged) 10jenkins-bot: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [11:19:06] (03PS2) 10Urbanecm: Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527052 (owner: 10Ladsgroup) [11:19:08] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/ExternalGuidance/: SWAT: 9402c36: Provide the messages in the target language of translation (T228019) (duration: 00m 56s) [11:19:12] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527052 (owner: 10Ladsgroup) [11:19:14] (03CR) 10jenkins-bot: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [11:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:16] T228019: Injected info does not get translated - https://phabricator.wikimedia.org/T228019 [11:19:36] Thanks! [11:19:43] Urbanecm: It's not testable [11:19:48] Amir1, ack [11:20:20] Thanks Urbanecm [11:20:22] yw kart_ [11:20:58] 10Operations, 10Traffic, 10Patch-For-Review, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Joe) >>! In T97972#5353056, @Volans wrote: >>>! In T97972#5352851, @Joe wrote: >> IIRC we already have an account specialized for accessi... [11:21:56] (03Merged) 10jenkins-bot: Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527052 (owner: 10Ladsgroup) [11:22:11] (03CR) 10jenkins-bot: Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527052 (owner: 10Ladsgroup) [11:22:57] Amir1, syncing [11:23:06] marostegui: ^ [11:23:07] Thanks [11:23:10] yep, I am ready [11:23:49] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: c164132: Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"" (T225053) (duration: 00m 55s) [11:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:57] T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053 [11:24:12] Amir1, marostegui: Patch was synced [11:26:42] Amir1: I am starting to see the read_key handler spiking, let's see what it does [11:27:04] (03CR) 10Mobrovac: "We are in the process of splitting RESTBase into two services in production, so I'd advocate for pushing this a bit down the line and add " [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [11:31:30] (03PS1) 10Urbanecm: Add nlm.nih.gov to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527082 (https://phabricator.wikimedia.org/T229470) [11:32:01] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527082 (https://phabricator.wikimedia.org/T229470) (owner: 10Urbanecm) [11:33:09] (03Merged) 10jenkins-bot: Add nlm.nih.gov to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527082 (https://phabricator.wikimedia.org/T229470) (owner: 10Urbanecm) [11:33:25] (03CR) 10jenkins-bot: Add nlm.nih.gov to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527082 (https://phabricator.wikimedia.org/T229470) (owner: 10Urbanecm) [11:34:57] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 1e4458e: Add nlm.nih.gov to the wgCopyUploadsDomains whitelist for commonswiki (T229470) (duration: 00m 53s) [11:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:05] T229470: Add nlm.nih.gov to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T229470 [11:36:59] (03PS1) 10Urbanecm: Add files.geocollections.info to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527084 (https://phabricator.wikimedia.org/T229547) [11:37:43] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527084 (https://phabricator.wikimedia.org/T229547) (owner: 10Urbanecm) [11:38:31] Amir1: It keeps increasing, but it is not a bad thing, that handler means the reads are being done from an index [11:38:41] (03Merged) 10jenkins-bot: Add files.geocollections.info to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527084 (https://phabricator.wikimedia.org/T229547) (owner: 10Urbanecm) [11:38:58] (03CR) 10jenkins-bot: Add files.geocollections.info to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527084 (https://phabricator.wikimedia.org/T229547) (owner: 10Urbanecm) [11:39:25] marostegui: yeah. Let's see when it stops. Also, there are plans to improve it but it might take a week or two to get it there [11:39:48] Amir1: the query latency remains the same, so there is not a degradation there [11:39:51] the traffic has increased [11:40:00] but not ther amount of queries or the processes [11:40:11] so we are just reading more, but so far, fast enough [11:40:49] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: c51baa3: Add files.geocollections.info to the wgCopyUploadsDomains whitelist for commonswiki (T229547) (duration: 00m 55s) [11:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:57] T229547: Add files.geocollections.info to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T229547 [11:41:08] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27460 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops [11:41:38] the traffic should not increase that much. [11:41:48] (03CR) 10Krinkle: Use GTIDs for master position queries for external DB when possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [11:42:18] !log EU SWAT done [11:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:26] Amir1: would you expect this change on queries? https://grafana.wikimedia.org/d/000000273/mysql?panelId=31&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&refresh=10s&from=now-3h&to=now [11:43:20] I don't think so [11:43:23] I can double check [11:44:01] I think we are having contention now https://grafana.wikimedia.org/d/000000273/mysql?panelId=23&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&refresh=10s&from=now-3h&to=now [11:44:59] The read_key handler is now decreasing, let's give it some more time, it might be getting everything in memory [11:45:59] Amir1: the traffic has basically shifted back to previous values [11:46:30] Amir1: Is everything ok, queries are going down [11:46:32] ? [11:46:35] (03PS6) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [11:47:00] yes [11:47:03] it looks good [11:47:19] ok, so yesterday's theory about the buffer pool seems correct [11:47:24] It was probably putting things into the program cache [11:47:34] *application cache [11:47:49] key handler also decreasing [11:47:54] (03PS1) 10Jbond: labpuppetmaster: add missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/527085 (https://phabricator.wikimedia.org/T229571) [11:48:37] (03CR) 10Jbond: [C: 03+2] labpuppetmaster: add missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/527085 (https://phabricator.wikimedia.org/T229571) (owner: 10Jbond) [11:50:42] (03CR) 10Urbanecm: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [11:53:20] (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527086 (https://phabricator.wikimedia.org/T228657) [11:53:50] (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527086 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [11:53:58] RECOVERY - Disk space on elastic1017 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops [11:54:59] Amir1: I think we are kinda back to previous values now [11:55:08] read_key handler still higher, but that's not bad [11:55:36] coooooooooooooooool [11:55:42] we did have some errors: https://grafana.wikimedia.org/d/000000273/mysql?panelId=10&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&refresh=10s&from=now-3h&to=now [11:55:55] Now we can move to items and finally kill wb_terms table [11:56:02] <3 [11:56:16] it's fine I guess [11:56:32] yeah, eveyrthing else looks similar to previous patterns [11:56:38] and the query latency hasn't changed [11:58:09] (03CR) 10Urbanecm: Fix AddGroups/RemoveGroups for editor/autoreview (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518759 (https://phabricator.wikimedia.org/T226410) (owner: 10Reedy) [12:03:37] (03PS1) 10Ladsgroup: Switch property terms migration to WRITE_NEW on client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527087 (https://phabricator.wikimedia.org/T225053) [12:12:20] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10CDanis) That was scheduled maintenance in Centurylink's ticket 16820717, should be resolved as of about two hours ago. [12:18:58] !log Rename math table on db1089 (enwiki) - T196055 [12:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:06] T196055: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055 [12:30:50] (03CR) 10CDanis: "> Patch Set 1:" (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis) [12:31:25] (03PS2) 10CDanis: dbctl: require commit messages [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 [12:36:39] (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527092 (https://phabricator.wikimedia.org/T228657) [12:38:13] !log add cp1008 to canary hosts https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/puppetmaster/frontend.yaml#L22 [12:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:34] (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527092 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [12:42:02] (03PS2) 10Filippo Giunchedi: site: lithium to spare [puppet] - 10https://gerrit.wikimedia.org/r/526980 (https://phabricator.wikimedia.org/T229557) [12:43:42] (03CR) 10Filippo Giunchedi: [C: 03+2] site: lithium to spare [puppet] - 10https://gerrit.wikimedia.org/r/526980 (https://phabricator.wikimedia.org/T229557) (owner: 10Filippo Giunchedi) [12:50:26] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [12:50:32] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:51:18] (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527095 (https://phabricator.wikimedia.org/T228657) [12:52:07] lithium tls failure is expected, being decom'd [12:52:45] (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527095 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [12:57:33] (03PS1) 10Jbond: puppetm,aster - canary_hosts: allow hosts top be IP addresses as well as fqdn [puppet] - 10https://gerrit.wikimedia.org/r/527096 [12:58:38] (03CR) 10Jbond: [C: 03+2] puppetm,aster - canary_hosts: allow hosts top be IP addresses as well as fqdn [puppet] - 10https://gerrit.wikimedia.org/r/527096 (owner: 10Jbond) [13:12:05] (03PS1) 10Jbond: puppetmaster::frontend: allow canary hosts to be IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/527097 [13:12:48] PROBLEM - puppet last run on alcyone is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:13:27] (03CR) 10Jbond: [C: 03+2] puppetmaster::frontend: allow canary hosts to be IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/527097 (owner: 10Jbond) [13:20:42] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Papaul) [13:21:03] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Papaul) 05Open→03Resolved Complete [13:28:45] (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527100 (https://phabricator.wikimedia.org/T228657) [13:29:46] (03PS2) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527100 (https://phabricator.wikimedia.org/T228657) [13:30:31] (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527100 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [13:32:48] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) a:05Papaul→03wiki_willy @jijiki I will talking to @wiki_willy to see what are our options on this. @wiki_willy this system is out if warranty since April 2019 and we do have a proble... [13:35:54] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Papaul) I checked IDRAC logs this morning, all looks good so far [13:37:20] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 2 others: Write incident report for jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Pchelolo) 05Open→03Resolved Report written. Please reopen if it's not sufficient. [13:44:49] RECOVERY - puppet last run on alcyone is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:48:45] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) Hehe, yeah, I checked too. Let's give it till Monday Cross your fingers! [13:57:24] (03PS1) 10CDanis: dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070) [13:59:59] (03PS2) 10CDanis: dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070) [14:00:04] cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) dbctl to 100% deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1400). [14:03:58] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Marostegui) I have set up the proxy for m2 in codfw. I kn... [14:08:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:09:33] (03CR) 10Ayounsi: [C: 03+2] check_anycast_healthchecker, add sudo bird rights [puppet] - 10https://gerrit.wikimedia.org/r/526791 (owner: 10Ayounsi) [14:12:56] (03PS2) 10Ayounsi: check_anycast_healthchecker, add sudo bird rights [puppet] - 10https://gerrit.wikimedia.org/r/526791 [14:14:12] * cdanis taking over mwdebug2002 for a quick test [14:17:18] (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527109 (https://phabricator.wikimedia.org/T228657) [14:17:54] * cdanis proceeding with rollout [14:18:17] (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527109 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [14:18:27] (03PS2) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527109 (https://phabricator.wikimedia.org/T228657) [14:18:30] (03CR) 10CDanis: [C: 03+2] dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:19:50] (03Merged) 10jenkins-bot: dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:20:40] (03CR) 10jenkins-bot: dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:22:03] !log cdanis@deploy1001 Synchronized wmf-config/etcd.php: Iaaa1238 dbctl to 100% of production! (duration: 00m 54s) [14:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:38] I'm going to deploy something for a train blocker [14:22:53] Amir1: ok, my deploy is done [14:23:19] cdanis: thanks and congrats for doing this. It's awesome. I love it [14:23:39] 😊 [14:23:59] (03PS7) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [14:24:01] I ran a test dump of a few revisions on a snapshot host. works fine. :-) [14:24:23] apergos: well, the old db-foo.php configs are still correct, for the moment [14:24:30] (03PS1) 10Marostegui: wmnet: Point m2-master.codfw to dbproxy2002 [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532) [14:24:34] cdanis: dbctl 100% ?? [14:24:37] yes but we don't use those directly for anything [14:24:38] elukey: 100% [14:24:42] it's all 'ask mediawiki' [14:24:45] Amir1: thanks for deploying that. i was just waiting for cdanis to finish, but i'm happy to let you do the honors :) [14:24:48] apergos: that should be fine [14:24:55] exactly! [14:25:06] \o/ [14:25:14] elukey: now I have some time for ONFIRE things ;) [14:25:22] (and thanks for doing what you did!) [14:25:41] mdholloway: I can do something else if you're on it [14:25:52] * Amir1 looks at his plate full of bugs [14:25:58] thank you for this work, now I can see marostegui partying now [14:26:15] Amir1: yeah, i can take it from here. [14:26:22] and we're live! :D [14:26:28] thanks then mdholloway [14:26:36] (03PS2) 10Marostegui: wmnet: Point m2-master.codfw to dbproxy2002 [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532) [14:27:50] I just realized I've forgotten to sync-file on CommonSettings.php, but my only changes there were to comments, so I won't worry about it for now: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/527104/2/wmf-config/CommonSettings.php [14:27:57] cdanis: you have no idea how useful it is for devs as well. Schema changes always took really long time. I basically stopped doing any schema change development because of the schema change process being so slow and hard for our DBAs and T191231 [14:27:58] T191231: RFC: Abstract schemas and schema changes - https://phabricator.wikimedia.org/T191231 [14:28:33] cdanis: do that too anyway to not leave changed files un-deployed [14:28:34] I'm glad to hear it :) also want to thank _joe_ and volans as well, could not have done it without either of them [14:28:41] I'm pretty sure it's ten times harder for our DBAs [14:30:38] (03CR) 10Paladox: [C: 03+1] "Awesome! Thank you! This will allow gerrit2001 to start!" [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532) (owner: 10Marostegui) [14:30:53] marostegui: btw. I will roll this change out for client wikis in Monday [14:31:03] (03CR) 10Alaa Sarhan: [C: 03+1] Switch property terms migration to WRITE_NEW on client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527087 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup) [14:31:16] (03CR) 10Eevans: [C: 03+1] Remove RESTBase graphite alerts. [puppet] - 10https://gerrit.wikimedia.org/r/525856 (https://phabricator.wikimedia.org/T185089) (owner: 10Ppchelko) [14:31:19] Amir1: Not sure I get what that means .) [14:31:31] the reads will shift again but given that caches are already warmed up, I don't think it'll cause any noticeable difference [14:32:12] Amir1: What do you mean with wiki clients? [14:32:17] marostegui: client wikis (=all wikis that read from wikidata) also access the term store [14:32:23] Ah right, ok [14:32:24] Sure [14:32:38] it's probably half of the reads [14:33:01] You'll deploy that at the normal SWAT time? [14:33:25] marostegui: yup [14:33:33] Amir1: cool, I'll be around [14:33:39] thanks [14:34:20] uh, shouldn’t they already access the new term store? [14:34:34] I thought the latest repo change was that the old store is no longer written to? [14:35:06] Lucas_WMDE: what do you mean? Can you elaborate more? [14:36:01] I thought the change you made in the repo was from “read new write both” to “read new” [14:36:16] is that wrong? was it only changing from “read old write both” to “read new write both”? [14:36:46] The change was “read old write both” to “read new write both” [14:36:53] ok good [14:36:54] on wikidata, client still reads old [14:36:56] Amir1: have you deployed yet? is it okay if I do another quick comment-only sync-file? [14:36:59] then the clients can still read either [14:37:08] cdanis: mdholloway is doing it [14:37:28] Lucas_WMDE: yes [14:37:34] cdanis: go for it, i'll be waiting on jenkins for a bit [14:37:37] we just didn't deploy it yet [14:37:42] rgr, ty [14:38:29] (03CR) 10jerkins-bot: [V: 04-1] table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [14:38:39] !log cdanis@deploy1001 Synchronized wmf-config/CommonSettings.php: Iaaa1238 comment-only no-op change (dbctl to 100% of production!) (duration: 00m 55s) [14:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:17] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:39:34] 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10fgiunchedi) 05Open→03Resolved Disk replaced and is rebuilding, thanks @Papaul [14:40:53] mdholloway / Amir1 - thanks again for that. [14:41:03] 10Operations, 10Traffic: fifo-log-tailer: evergrowing memory usage - https://phabricator.wikimedia.org/T229414 (10ema) 05Open→03Resolved The new `fifo-log-tailer` has now been running for one day and shows reasonable memory usage: ` 14:39:53 ema@cp1080.eqiad.wmnet:~ $ ps u -q `pidof fifo-log-tailer` USER... [14:41:22] (03CR) 10Herron: [C: 03+1] "Looks like a workable stopgap. We could probably run it on a 5/10 min interval to recover more quickly if/when rsyslog fails." [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [14:41:31] brennen: np! [14:42:01] brennen: I actually caused the issue so I don't think I should be thanked. Sorry for the trouble [14:42:12] Amir1: btw, where's the best place to chat with WMDE folks? i joined #wikidata a few days ago, but it looked pretty desolate, so i left [14:42:22] (03PS8) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [14:42:53] i guess this channel works, in any case [14:43:21] #wikimedia-de-tech is the usual channel [14:43:47] ah, thanks [14:44:38] (03CR) 10Ema: "> We are in the process of splitting RESTBase into two services in" [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:49:16] (03CR) 10AndyRussG: [C: 04-1] "Hi! Mediawiki config changes should not be +2'd until the time of the deploy window when they'll be deployed. See: https://wikitech.wikime" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [14:50:42] (03CR) 10AndyRussG: [C: 04-1] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [14:50:48] 10Operations, 10ops-codfw: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T229283 (10fgiunchedi) 05Open→03Resolved Disk replaced and rebuilding [14:50:51] RECOVERY - Device not healthy -SMART- on ms-be2021 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2021&var-datasource=codfw+prometheus/ops [14:50:58] Lucas_WMDE: that appears to be the IRC channel for today that i didn't know existed but should have. i seem to pick up at least one a week. [14:51:17] !log performing rolling restarts of eqiad logstash cluster for security updates [14:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:47] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:53:19] RECOVERY - HP RAID on ms-be2021 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:54:43] (03CR) 10jerkins-bot: [V: 04-1] table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [14:57:29] (03PS1) 10BPirkle: Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) [15:00:56] (03CR) 10AndyRussG: [C: 04-1] "Any thoughts on T225261? Maybe we could at least partly bring this inline with site-wide policy? Also, when site-wide CSP becomes enforced" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [15:02:01] PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:03:15] PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:03:57] RECOVERY - Device not healthy -SMART- on ms-be2018 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018&var-datasource=codfw+prometheus/ops [15:04:33] RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:05:56] OK, looks like nothing's on for Puppet SWAT [15:06:51] herron: should i wait for you to finish before deploying the backports for the train blocker fix, or can that happen in parallel? (or are you already done?) [15:07:19] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 104.65, 120.56, 74.82 https://wikitech.wikimedia.org/wiki/Swift [15:07:29] RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:07:36] mdholloway: no, please carry on. it’s slow online rolling restart with no impact expected [15:07:44] cool, thanks [15:11:17] Amir1: did something happened at around 14:00 UTC? there was a spike, similar to the one when you deployed the change [15:11:26] https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1109&var-port=9104&from=now-6h&to=now&refresh=5s [15:11:55] marostegui: I don't remember deploying anything [15:11:57] RECOVERY - very high load average likely xfs on ms-be2018 is OK: OK - load average: 43.55, 73.68, 66.54 https://wikitech.wikimedia.org/wiki/Swift [15:12:29] Amir1: any possible explanation for that? [15:13:04] !log mholloway-shell@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/Wikibase: Do not warn about entity that was not found in WikiPageEntityRevisionLookup (T229482) (duration: 01m 20s) [15:13:06] marostegui: that's too early for the dbctl 100% [15:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:12] T229482: PHP Warning: Wikibase\Lib\Store\Sql\WikiPageEntityRevisionLookup::getEntityRevision: Entity not loaded - https://phabricator.wikimedia.org/T229482 [15:13:18] cdanis: yeah, it started at around 13:58 or so [15:13:21] marostegui: did you check SAL? [15:13:25] Amir1: yep [15:14:11] I don't think I did anything. :/ [15:14:32] Amir1: Then it is very weird, as the spike it is very similar (also in duration) to the one we saw with the first deploy [15:14:54] Amir1: Same type of select even :-/ [15:15:20] it can be that some caches got evicted, specially if it happens again [15:15:28] then we need to do something about it [15:15:45] Amir1: Yeah, definitely, those spikes aren't good if they happen that often, let's keep an eye on it [15:15:53] sure [15:15:57] I am pretty sure it is related, because it is exactly the same pattern on pretty much every graph: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1109&var-port=9104&from=now-6h&to=now&refresh=5s [15:16:26] !log mholloway-shell@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/Wikibase: Do not warn about entity that was not found in WikiPageEntityRevisionLookup (T229482) (duration: 01m 14s) [15:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:08] (03CR) 10Ema: [C: 03+1] fifo_log_demux: Remove socket activation [puppet] - 10https://gerrit.wikimedia.org/r/527075 (owner: 10Vgutierrez) [15:20:17] mdholloway: i should be clear at this point to proceed with wmf.16 -> group1, yeah? [15:21:20] brennen: yep, should be clear [15:21:30] cool, thanks. [15:22:12] (03PS4) 10Mforns: analytics::refinery::job::data_purge Migrate webrequest timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) [15:23:42] (03CR) 10Ema: fifo-log-demux: Keep attempting to read the FIFO after EOF (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 (owner: 10Vgutierrez) [15:24:21] (03PS5) 10Mforns: analytics::refinery::job::data_purge Migrate webrequest timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) [15:25:07] (03PS9) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [15:27:10] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Juniper alarm active [15:30:21] (03PS1) 10Brennen Bearnes: group1 wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527128 [15:30:23] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527128 (owner: 10Brennen Bearnes) [15:31:34] 10Operations, 10Traffic, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10BBlack) 05Open→03Declined We had a quick discussion and a small informal vote and decided we don't really need this functionality (pinkunicorn) anymore, so we're going to retire it... [15:31:37] 10Operations, 10ops-eqiad, 10decommission, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10BBlack) [15:32:17] (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527129 (https://phabricator.wikimedia.org/T228657) [15:34:28] (03CR) 10Mobrovac: "> Do you have a rough estimate of when "a bit down the line" could" [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [15:34:40] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Jclark-ctr) [15:35:21] (03CR) 10Ppchelko: [C: 03+1] Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [15:36:26] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527128 (owner: 10Brennen Bearnes) [15:39:05] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.16 [15:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:00] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.16 (duration: 00m 54s) [15:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:12] (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527129 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [15:40:20] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 92.86% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:41:38] (03PS1) 10Alexandros Kosiaris: restrouter: Switch to event_service_uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/527130 (https://phabricator.wikimedia.org/T223953) [15:44:42] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [15:45:34] PROBLEM - Device not healthy -SMART- on ms-be2018 is CRITICAL: cluster=swift device=cciss,2 instance=ms-be2018:9100 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018&var-datasource=codfw+prometheus/ops [15:45:46] (03CR) 10Eevans: [C: 03+1] Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [15:47:52] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [15:47:55] !log start codfw mgmt work - T228112 [15:48:02] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527128 (owner: 10Brennen Bearnes) [15:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:03] T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 [15:48:40] (03PS4) 10Filippo Giunchedi: facilities: add model to pdu monitoring [puppet] - 10https://gerrit.wikimedia.org/r/526633 (https://phabricator.wikimedia.org/T148541) [15:50:27] (03CR) 10Alexandros Kosiaris: restrouter: Add helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/526719 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [15:51:41] (03PS2) 10Mforns: analytics::refinery::job::data_purge Migrate EL timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519685 (https://phabricator.wikimedia.org/T226862) [15:51:44] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:52:08] 10Operations, 10Analytics, 10Discovery, 10Research-Backlog: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10EBernhardson) [15:52:21] (03CR) 10Ppchelko: [V: 03+2 C: 03+2] restrouter: Switch to event_service_uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/527130 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [15:53:24] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:54:22] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [15:55:52] 10Operations, 10DBA, 10serviceops-radar, 10Performance-Team (Radar): phased rollout of dbctl, etcd-backed database configuration in Mediawiki - https://phabricator.wikimedia.org/T229070 (10Krinkle) >>! In T229070#5367389, @gerritbot wrote: > Change 525684 had a related patch set uploaded (by CDanis; owner:... [15:55:57] 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) Hi all, I just stumbled upon this task while investigating something else. Its something I'm happy to progress however i wanted to consider if the... [15:57:02] (03CR) 10Filippo Giunchedi: [C: 03+2] facilities: add model to pdu monitoring [puppet] - 10https://gerrit.wikimedia.org/r/526633 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [15:57:38] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [15:58:10] 04Critical Alert for device cr1-codfw.wikimedia.org - Juniper alarm active [15:59:18] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:59:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:59:42] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:00:04] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10BBlack) [16:01:28] (03PS1) 10BBlack: acamar/achernar: site.pp cleanup post-decom [puppet] - 10https://gerrit.wikimedia.org/r/527133 (https://phabricator.wikimedia.org/T198286) [16:01:29] (03PS1) 10BBlack: eqiad cp decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/527134 (https://phabricator.wikimedia.org/T229586) [16:02:24] (03CR) 10BBlack: [V: 03+2 C: 03+2] acamar/achernar: site.pp cleanup post-decom [puppet] - 10https://gerrit.wikimedia.org/r/527133 (https://phabricator.wikimedia.org/T198286) (owner: 10BBlack) [16:03:11] (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527135 (https://phabricator.wikimedia.org/T228657) [16:04:26] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10wiki_willy) @Papaul - if you can't find a spare from any of those decom servers, we can order it, since it's still a while before the 5yr mark. Thanks Willy [16:05:46] !log power down msw1-codfw [16:05:48] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [16:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:04] (03PS2) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527135 (https://phabricator.wikimedia.org/T228657) [16:09:02] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [16:09:29] (03CR) 10Urbanecm: "> Patch Set 2: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [16:10:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) a:05Nuria→03None I've approved as manager, so moving back to unassigned for... [16:11:11] (03PS1) 10Krinkle: noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 [16:11:33] (03CR) 10Jbond: [V: 03+2 C: 03+2] "simple patch CI taking too long" [puppet] - 10https://gerrit.wikimedia.org/r/527135 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [16:12:13] (03PS2) 10Krinkle: noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) [16:14:49] (03PS2) 10Cwhite: logstash: update statsd exporter mappings and use exporter [puppet] - 10https://gerrit.wikimedia.org/r/526782 (https://phabricator.wikimedia.org/T205870) [16:18:58] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [16:22:12] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [16:24:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/526782 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:25:34] (03PS2) 10BBlack: eqiad cp decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/527134 (https://phabricator.wikimedia.org/T229586) [16:32:51] oh, i'm surprised you all could actually hear me in the meeting, i just realized i've got the wrong (mic-less) headphones on [16:33:48] i guess someone would have said something if it was an issue [16:35:10] (03CR) 10Krinkle: Bring up password change logging to the same standards as login logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza) [16:37:26] (03CR) 10BBlack: [C: 03+2] eqiad cp decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/527134 (https://phabricator.wikimedia.org/T229586) (owner: 10BBlack) [16:40:35] sheesh, wrong channel [16:40:49] (03PS1) 10BBlack: pink unicorn death [dns] - 10https://gerrit.wikimedia.org/r/527145 (https://phabricator.wikimedia.org/T229586) [16:44:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp1008.wikimedia.o... [16:45:22] (03CR) 10BBlack: [C: 03+2] pink unicorn death [dns] - 10https://gerrit.wikimedia.org/r/527145 (https://phabricator.wikimedia.org/T229586) (owner: 10BBlack) [16:45:43] 10Operations, 10ops-eqiad, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Jclark-ctr) [16:46:39] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Jclark-ctr) [16:51:10] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10greg) a:05greg→03mmodell >>! In T226044#5380942, @greg wrote: >>>! In T226044#5380759... [16:54:14] PROBLEM - puppet last run on aqs1008 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:57:29] 10Operations, 10Puppet, 10Packaging: puppet fails to run in cp1008 under certain conditions - https://phabricator.wikimedia.org/T221343 (10BBlack) 05Open→03Declined Decom in T229586 [16:57:37] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10BBlack) [16:58:06] RECOVERY - Device not healthy -SMART- on ms-be2018 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018&var-datasource=codfw+prometheus/ops [17:00:05] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1700). [17:01:00] 10Operations, 10Puppet, 10observability: Use git commit id as "configuration version" for puppet - https://phabricator.wikimedia.org/T228854 (10jbond) I think this is a really good idea. further after a bit of investigation i think any arbitrary string can be used. Later versions of the puppet documentati... [17:02:10] 04Critical Alert for device msw1-codfw.mgmt.codfw.wmnet - Juniper alarm active [17:09:36] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 62.01, 30.65, 21.36 https://wikitech.wikimedia.org/wiki/Application_servers [17:09:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1073.eqiad.wmnet', 'cp1074.eqiad.wmnet', 'cp1072.eqiad.wmnet', 'cp... [17:10:56] (03Abandoned) 10BBlack: ncredir hostname and service IP [dns] - 10https://gerrit.wikimedia.org/r/295249 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [17:11:52] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [17:11:55] (03Abandoned) 10BBlack: redirects.dat - split non-canonical to separate section [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [17:12:12] RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 23.10, 28.77, 22.31 https://wikitech.wikimedia.org/wiki/Application_servers [17:12:25] (03Abandoned) 10BBlack: [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 (owner: 10BBlack) [17:12:27] (03PS3) 10Elukey: analytics::refinery::job::data_purge Migrate EL timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519685 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [17:12:29] (03Abandoned) 10BBlack: VCL: grace-within-TTL [puppet] - 10https://gerrit.wikimedia.org/r/364606 (owner: 10BBlack) [17:12:31] (03CR) 10Gergő Tisza: Bring up password change logging to the same standards as login logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza) [17:12:35] (03CR) 10Dzahn: "boldly removing Giuseppe's -2 because we talked about it in meeting and agreed it's good to go now" [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [17:13:12] (03Abandoned) 10BBlack: Browser connection security warnings, again [puppet] - 10https://gerrit.wikimedia.org/r/407701 (owner: 10BBlack) [17:13:55] (03Abandoned) 10BBlack: [WIP] Move cache::canary from cp1008 to cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/451326 (owner: 10BBlack) [17:14:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [17:14:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10BBlack) These are ready to go for dcops-level work! [17:14:42] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:17:10] 04̶C̶r̶i̶t̶i̶c̶a̶l Device msw1-codfw.mgmt.codfw.wmnet recovered from Juniper alarm active [17:19:24] (03PS1) 10BBlack: Remove cache::canary stuff [puppet] - 10https://gerrit.wikimedia.org/r/527157 [17:20:08] RECOVERY - puppet last run on aqs1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:21:28] (03Abandoned) 10BBlack: XXX note bad entries for conf200x in network::constants [puppet] - 10https://gerrit.wikimedia.org/r/465455 (owner: 10BBlack) [17:23:34] (03Abandoned) 10BBlack: CI check [dns] - 10https://gerrit.wikimedia.org/r/483198 (owner: 10BBlack) [17:24:30] (03Abandoned) 10BBlack: discovery-map remove [1/4]: remove refs [puppet] - 10https://gerrit.wikimedia.org/r/522110 (owner: 10BBlack) [17:24:40] (03Abandoned) 10BBlack: discovery-map remove [2/4]: remove ops/dns refs [dns] - 10https://gerrit.wikimedia.org/r/522113 (owner: 10BBlack) [17:24:51] (03Abandoned) 10BBlack: discovery-map remove [3/4]: stop deploying [puppet] - 10https://gerrit.wikimedia.org/r/522111 (owner: 10BBlack) [17:24:59] (03Abandoned) 10BBlack: discovery-map remove [4/4]: Remove completely [puppet] - 10https://gerrit.wikimedia.org/r/522112 (owner: 10BBlack) [17:25:32] PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:26:29] (03Abandoned) 10BBlack: Block POSTs to some wiki URLs [puppet] - 10https://gerrit.wikimedia.org/r/240389 (owner: 10Coren) [17:28:11] 04Critical Alert for device cr1-codfw.wikimedia.org - Juniper alarm active got acknowledged [17:28:39] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge Migrate EL timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519685 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [17:28:46] 10Operations, 10Analytics, 10SRE-Access-Requests: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Mayakp.wiki) Thanks @Nuria for the query and suggestion. I will use Jupyter and Beeline in the meantime. Please let me know whenever my HUE access is granted. https://wikitech.wikimed... [17:29:19] (03CR) 10BBlack: "Compiler looks sane: https://puppet-compiler.wmflabs.org/compiler1001/17697/" [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [17:29:32] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:29:32] (03PS10) 10BBlack: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [17:29:41] (03PS6) 10Elukey: analytics::refinery::job::data_purge Migrate webrequest timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [17:31:29] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge Migrate webrequest timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [17:32:16] RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:36:08] !log running db dump on phab1003 (in tmux). command: sudo ./bin/storage dump --output /srv/dumps/phabricator_db_20190801.sql.gz --compress [17:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:10] I'm almos tcurious enough about what that does to go look at the code... but not quite [17:41:15] (going on 9 pm) [17:42:44] !log disable puppet on lvs1014 + lvs1016 for cloudelastic LVS merge - T224324 [17:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:53] T224324: LB for cloudelastic - https://phabricator.wikimedia.org/T224324 [17:43:24] (03PS11) 10BBlack: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [17:44:29] (03CR) 10BBlack: [C: 03+2] LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [17:49:01] (03PS1) 10BBlack: cloudelastic hieradata: fix parens mismatch typo [puppet] - 10https://gerrit.wikimedia.org/r/527165 [17:49:19] (03CR) 10BBlack: [V: 03+2 C: 03+2] cloudelastic hieradata: fix parens mismatch typo [puppet] - 10https://gerrit.wikimedia.org/r/527165 (owner: 10BBlack) [17:50:08] PROBLEM - Check systemd state on cloudelastic1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:12] PROBLEM - Check systemd state on cloudelastic1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:22] PROBLEM - Check systemd state on cloudelastic1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:46] RECOVERY - Check systemd state on cloudelastic1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:46] RECOVERY - Check systemd state on cloudelastic1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:47] ^ me, working on it [17:51:58] RECOVERY - Check systemd state on cloudelastic1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:00] PROBLEM - puppet last run on mw1270 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:58:51] (03PS3) 10Ejegg: CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T225261) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1800). [18:00:04] bpirkle: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:13] (03CR) 10Ejegg: "Thanks AndyRussG! I've invoked the powers of ctrl-c ctrl-v to bring this preview CSP more in line with the existing CSP as you suggest." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T225261) (owner: 10Ejegg) [18:00:14] I'm here [18:00:15] I can SWAT today! [18:00:51] (03CR) 10Urbanecm: [C: 03+2] Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [18:02:15] (03Merged) 10jenkins-bot: Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [18:02:17] (03PS1) 10Volans: cumin: remove old scripts converted to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886) [18:02:24] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:02:31] (03CR) 10jenkins-bot: Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [18:02:52] bpirkle, pulled onto mwdebug1002, if it's testable there [18:03:14] PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:04:03] Good to go [18:04:41] syncing bpirkle [18:04:44] RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:05:04] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:05:44] PROBLEM - HHVM rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [18:06:10] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: 469c42d: Switch testwiki to read sessions from kask, with fallback to redis (T222099) (duration: 00m 55s) [18:06:14] bpirkle, done [18:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:19] T222099: Staging release of RESTBagOStuff using Kask - https://phabricator.wikimedia.org/T222099 [18:06:45] Urbanecm: thank you [18:06:48] yw bpirkle [18:06:51] (03PS1) 10BBlack: Attempt to fix cloudelastic LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/527172 [18:08:48] RECOVERY - HHVM rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 75584 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:09:31] (03CR) 10Volans: "I'll manually delete those files from the cumin hosts." [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [18:09:56] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:56] PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:13:06] (03PS1) 10Volans: sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586) [18:13:24] RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:19:46] (03PS1) 10Dzahn: acme_chief: replace cp1008 with cp1099 as authorized host [puppet] - 10https://gerrit.wikimedia.org/r/527175 (https://phabricator.wikimedia.org/T229586) [18:20:31] (03PS2) 10BBlack: Attempt to fix cloudelastic LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/527172 [18:20:35] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:21:24] (03CR) 10jerkins-bot: [V: 04-1] Attempt to fix cloudelastic LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/527172 (owner: 10BBlack) [18:23:05] RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:23:21] (03PS1) 10Dzahn: varnish wikimedia-backend.vcl: replace cp1008 with cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/527177 (https://phabricator.wikimedia.org/T229586) [18:23:45] (03PS3) 10BBlack: Attempt to fix cloudelastic LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/527172 [18:24:08] (03CR) 10BBlack: "Compiler success! https://puppet-compiler.wmflabs.org/compiler1002/17710/cloudelastic1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/527172 (owner: 10BBlack) [18:24:37] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:24:59] (03CR) 10BBlack: [C: 03+2] Attempt to fix cloudelastic LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/527172 (owner: 10BBlack) [18:25:03] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:27:53] (03CR) 10Krinkle: "This diff is not a simple as I'd expect for adding the third-party domain. Perhaps these changes should be split?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T225261) (owner: 10Ejegg) [18:29:34] mutante: cp1099 is dying too, I'll look at whatever you're doing there in a sec... [18:30:01] !log lvs1016: puppet re-enabled, pybal restarted, cloudelastic deploy - T224324 [18:30:06] bblack: ok, cool. no rush, i see you are busy [18:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:12] T224324: LB for cloudelastic - https://phabricator.wikimedia.org/T224324 [18:31:23] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 70 connections established with conf1004.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal [18:32:19] !log bblack@puppetmaster1001 conftool action : set/pooled=yes; selector: name=^cloudelastic.* [18:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:40] (03PS1) 10Dzahn: puppetmaster::frontend: remove cp1008 as a canary host [puppet] - 10https://gerrit.wikimedia.org/r/527180 (https://phabricator.wikimedia.org/T229586) [18:32:47] RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:44] (03PS8) 10Dzahn: parsoid::testing: add mediawiki appserver profiles to role [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) [18:34:57] PROBLEM - LVS HTTP IPv4 on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.84 and port 8643: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:35:11] paged [18:35:12] ^ ignore that [18:35:16] ignored [18:35:17] known [18:35:18] ack, ignoring [18:35:19] new service just defined, not in use [18:35:44] paged [18:35:48] ok [18:36:22] ACKNOWLEDGEMENT - LVS HTTP IPv4 on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.84 and port 8643: Connection refused CDanis bblack new service just defined, not in use https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:36:26] doing something I think we should be in the habit of doing 🙃 [18:36:29] ACKNOWLEDGEMENT - LVS HTTP IPv4 on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.84 and port 8643: Connection refused Brandon Black issues bringing up a new service, non-critical for now! https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:36:29] ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb6_8443: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org are marked down but pooled: cloudelasticlb6_9643: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org [18:36:29] ut pooled: cloudelasticlb6_8243: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org are marked down but pooled: cloudelasticlb6_9443: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org are marked down but pooled: cloudelasticlb6_8643: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org are marked down but pooled Brandon Black issues bringing up a new service, non-critical [18:36:29] /wikitech.wikimedia.org/wiki/PyBal [18:36:38] all related [18:36:50] uh hu [18:36:51] h [18:36:55] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 82 connections established with conf1004.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal [18:37:19] cdanis: right! better getting a second SMS that says it's ACKed if you are not near laptop [18:37:37] (in the past the ACK did not create one but nowadays it does) [18:38:49] mutante: I learned when I was updating https://wikitech.wikimedia.org/wiki/Incident_response for other reasons that ACKing has been policy for some time as well [18:39:00] just something many of us never remember to do [18:40:38] cdanis: i agree very much. have been pushing for it [18:40:43] PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:40:57] (03PS4) 10Ejegg: CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) [18:41:00] (03PS1) 10Ejegg: Make banner-preview CSP match normal CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527183 (https://phabricator.wikimedia.org/T225261) [18:41:15] then you can also look at the "unhandled problems" in browser [18:41:24] I'm also looking forward to go.dog's https://gerrit.wikimedia.org/r/c/operations/puppet/+/525536 being merged [18:41:27] and the ones still there are meaningful [18:41:29] and will add an IRC highlight word when it does [18:42:51] ah, yea, that's nice [18:43:47] RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:44:35] PROBLEM - configured eth on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:44:37] PROBLEM - Disk space on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [18:44:37] PROBLEM - MD RAID on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:44:43] PROBLEM - Check size of conntrack table on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:44:57] PROBLEM - Check systemd state on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:03] PROBLEM - Check whether ferm is active by checking the default input chain on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:45:03] ^ OOM because some script.. which killed NRPE server.. [18:45:05] PROBLEM - dhclient process on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:45:09] unfortunately common [18:45:16] and so noisy [18:45:19] PROBLEM - DPKG on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:45:37] PROBLEM - puppet last run on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:45:57] yep, that was exactly it again [18:46:13] RECOVERY - configured eth on stat1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:46:15] RECOVERY - Disk space on stat1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [18:46:15] RECOVERY - MD RAID on stat1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:46:19] RECOVERY - Check size of conntrack table on stat1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:46:31] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:39] RECOVERY - Check whether ferm is active by checking the default input chain on stat1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:46:39] RECOVERY - dhclient process on stat1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:46:52] (03CR) 10Ejegg: "Good call Krinkle. Now split." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [18:46:55] RECOVERY - DPKG on stat1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:47:01] !log stat1004 - starting nagios-nrpe-server which got killed again - jbd2/md0-8 invoked oom-killer [18:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:55] (03CR) 10Krinkle: [C: 03+1] CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [18:48:50] (03CR) 10Dzahn: [C: 03+2] parsoid::testing: add mediawiki appserver profiles to role [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [18:51:11] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:52:16] !log scandium (parsoid testing) - added mw application server roles - puppet work / maintenance [18:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - American version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1900). [19:08:12] (03PS1) 10Brennen Bearnes: all wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527192 [19:08:14] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527192 (owner: 10Brennen Bearnes) [19:09:21] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527192 (owner: 10Brennen Bearnes) [19:09:41] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527192 (owner: 10Brennen Bearnes) [19:12:34] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.16 [19:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:35] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:20:38] !log rolling back to wfm.15 on group1 and group2 while we investigate T229575 [19:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:47] T229575: phabricator server 500 error - https://phabricator.wikimedia.org/T229575 [19:25:21] PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:25:32] (03PS1) 10Brennen Bearnes: Group1 and Group2 to php-1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527196 [19:25:34] (03CR) 10Brennen Bearnes: [C: 03+2] Group1 and Group2 to php-1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527196 (owner: 10Brennen Bearnes) [19:26:04] (03PS1) 10BBlack: cloudelastic: add mapped ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/527197 [19:26:37] (03CR) 10BBlack: [C: 03+2] cloudelastic: add mapped ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/527197 (owner: 10BBlack) [19:26:41] (03Merged) 10jenkins-bot: Group1 and Group2 to php-1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527196 (owner: 10Brennen Bearnes) [19:26:57] (03CR) 10jenkins-bot: Group1 and Group2 to php-1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527196 (owner: 10Brennen Bearnes) [19:30:01] RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:30:03] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis) [19:31:36] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 and group2 to 1.34.0-wmf.15 [19:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:00] !log lvs1014 - puppetize and restart pybal for cloudelastic LVS - T224324 [19:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:08] T224324: LB for cloudelastic - https://phabricator.wikimedia.org/T224324 [19:36:13] yeah one of the icinga checks is still borked [19:36:20] what a nightmare LVS service config is! [19:37:17] bblack: I feel like it shouldn't be so bad :( [19:37:24] of course :) [19:37:59] 10Operations, 10ops-eqiad, 10netops: asw2-c-eqiad:xe-2/0/45 inbound interface errors - https://phabricator.wikimedia.org/T229612 (10ayounsi) p:05Triage→03Normal [19:38:00] it was poorly-factored way back when, but at least it was relatively tame and easy to understand (IMHO) [19:38:37] but in the years since, it's been abused and neglected I think during a bunch of attempts at refactoring it "better" and only getting halfway there, and suffered at the hands of various meta-changes to style standards that don't suit it well, too. [19:38:46] it's a complete mess right now [19:39:00] (03PS1) 10Brennen Bearnes: Reverting php symlink for revert of group1 and group2 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527200 [19:39:02] (03CR) 10Brennen Bearnes: [C: 03+2] Reverting php symlink for revert of group1 and group2 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527200 (owner: 10Brennen Bearnes) [19:39:15] !log finished phabricator database dump [19:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:28] I have a lot of history with it, but haven't configured a complex new service in a while, and even I can't make any sense of a right way to do things that works for all known cases :P [19:39:39] heh [19:40:04] !log brennen@deploy1001 Synchronized php: Revert group1 and group2 back to 1.34.0-wmf.15 (duration: 00m 53s) [19:40:07] PROBLEM - PyBal IPVS diff check on lvs1014 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.241:9443, 2620:0:861:1:208:80:154:241:8243, 2620:0:861:1:208:80:154:241:9643, 208.80.154.241:8643, 208.80.154.241:9243, 2620:0:861:1:208:80:154:241:8443, 2620:0:861:1:208:80:154:241:9243, 2620:0:861:1:208:80:154:241:8643, 208.80.154.241:8243, 208.80.154.241:8443, 208.80.154.241:9643, 2620:0:861:1:208:80:154:2 [19:40:07] /wikitech.wikimedia.org/wiki/PyBal [19:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:11] (03Merged) 10jenkins-bot: Reverting php symlink for revert of group1 and group2 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527200 (owner: 10Brennen Bearnes) [19:40:28] that error will clear itself shortly [19:40:32] (03CR) 10jenkins-bot: Reverting php symlink for revert of group1 and group2 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527200 (owner: 10Brennen Bearnes) [19:40:57] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:33] this is the remaining issue now, and I suspect it's deep [19:41:37] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cloudelastic.wikimedia.org [19:42:07] note the check should be checking cloudelastic.wikimedia.org, but the IP it's complaining about is actually icinga1001's [19:42:33] I think something broke a while back with various auto-configured HTTPS checks for services, and in general many of them are now polling icinga itself instead of the intended target [19:42:43] but it's only "obvious" when you poll a port that icinga itself doesn't listen on :P [19:43:01] I fixed up something similar for a single case last week I think, but it didn't dawn on me that it could be widespread until now [19:44:23] uhm that's scary if true [19:45:38] yup [19:45:41] RECOVERY - PyBal IPVS diff check on lvs1014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:46:40] 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10ayounsi) This should probably wait on T219908. Whatever solution we find to configure IPv4 based on Netbox data, IPv6 should be the same. [19:46:51] hmmm maybe it's not widespread [19:47:02] I only see this example when looking at icinga1001's deployed config [19:53:17] (03PS1) 10BBlack: cloudelastic LVS: avoid "lb4" suffix [puppet] - 10https://gerrit.wikimedia.org/r/527204 [19:53:54] (03CR) 10BBlack: [C: 03+2] cloudelastic LVS: avoid "lb4" suffix [puppet] - 10https://gerrit.wikimedia.org/r/527204 (owner: 10BBlack) [19:57:19] 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi) 05Open→03Resolved This is done. [19:57:21] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) [19:57:31] !log lvs1016 - restart pybal for slight LVS config change for cloudelastic - T224324 [19:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:41] T224324: LB for cloudelastic - https://phabricator.wikimedia.org/T224324 [19:57:46] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) [19:58:57] PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:00:33] RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:03:05] (03PS2) 10BBlack: Remove cache::canary stuff [puppet] - 10https://gerrit.wikimedia.org/r/527157 [20:04:01] 10Operations, 10netops, 10observability: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 (10ayounsi) Service Request ID 2019-0801-0611 has been created. [20:06:56] (03CR) 10BBlack: [C: 03+2] Remove cache::canary stuff [puppet] - 10https://gerrit.wikimedia.org/r/527157 (owner: 10BBlack) [20:07:32] (03PS1) 10BBlack: more cp1008 cleanup around puppet [puppet] - 10https://gerrit.wikimedia.org/r/527209 [20:09:23] cmon jerkins... [20:11:19] (03PS2) 10BBlack: more cp1008 cleanup around puppet [puppet] - 10https://gerrit.wikimedia.org/r/527209 [20:11:37] (03CR) 10BBlack: [V: 03+2 C: 03+2] more cp1008 cleanup around puppet [puppet] - 10https://gerrit.wikimedia.org/r/527209 (owner: 10BBlack) [20:11:40] whatever jerkins [20:17:56] :( [20:18:06] bblack: give us a few more servers, plz [20:20:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active, AS1299/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:21:44] (03CR) 10Cwhite: [C: 03+2] logstash: update statsd exporter mappings and use exporter [puppet] - 10https://gerrit.wikimedia.org/r/526782 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [20:21:46] that's probably not that hard! [20:21:51] (03PS3) 10Cwhite: logstash: update statsd exporter mappings and use exporter [puppet] - 10https://gerrit.wikimedia.org/r/526782 (https://phabricator.wikimedia.org/T205870) [20:26:51] greg-g: didn't we talk about that before annual planning? I though we did. :/ [20:29:04] !log restart pybal on lvs1014 [20:29:09] we did, but then we also have a committment from SRE for a k8s cluster for CI, so we're OK, we just need to get to a point where we can move to it, technologically (aka, ditch zuul2) [20:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:11] RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:05] the lvs -> icinga monitoring thing makes no sense :P [20:38:40] the part that it pages immediately and you can't add a new service without causing that? [20:41:19] * Krinkle is going to deployhttps://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/527208/ [20:41:32] There is no application set to open the URL deployhttps://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/527208/. [20:42:41] Try 'Human.app' ? ;-) [20:44:18] mutante: no, the part where the monitoring definition comes out completely wrong [20:44:34] bblack: ugh :/ [20:44:43] (03PS1) 10BBlack: cloudelastic LVS: try a different icinga check with explicit hostname [puppet] - 10https://gerrit.wikimedia.org/r/527214 [20:46:19] !log puppetmaster: create mcrouter certs for scandium.eqiad.wmnet needed to make it an appserver (https://wikitech.wikimedia.org/wiki/Mcrouter#Generate_certs_for_a_new_host) (T228069) [20:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:29] T228069: Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069 [20:47:42] (03CR) 10BBlack: [C: 03+2] cloudelastic LVS: try a different icinga check with explicit hostname [puppet] - 10https://gerrit.wikimedia.org/r/527214 (owner: 10BBlack) [20:47:51] !log scandium - turning into an mw appserver [20:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:25] and also the part where if you define 6 services against one service hostname, you only get 1/6 checks defined :P [20:50:09] (because the horrible yaml-parsing ERB hack for it stores them in a map by hostname keys) [20:50:42] i see :( [20:51:05] but hey, at least the 1 check it made actually polls icinga1001 instead of the intended service :P [20:51:49] haha, oh man. must have broken during some refactoring i guess [20:52:03] yup [20:52:31] I think a whole lot of broken refactoring has happened to all LVS-related things (not that it was awesome before all of that, either) [20:53:10] (03CR) 10Dzahn: "hmm.. it installled Notice: /Stage[main]/Packages::Hhvm_dbg/Package[hhvm-dbg]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [20:54:44] Krinkle: is my reading correct that once that is deployed, train ought to be clear to proceed? [20:54:50] * Krinkle staging on mwdebug1002 [20:54:58] brennen: yep, certainly worth trying. [20:55:23] What could possibly break? ;-) [20:56:43] i'm sure we'll find out in due time. [20:57:20] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.16/includes/Revision/RevisionRenderer.php: T229589 - 3f1b32e4db3698b8 (duration: 00m 50s) [20:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:29] T229589: PHP Notice: Undefined property: MediaWiki\Revision\RevisionRenderer::$wikiId - https://phabricator.wikimedia.org/T229589 [20:57:32] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10BBlack) [20:57:48] the ultimate fix for anything: give up and file a ticket and hope someone else fixes it :P [21:00:12] ;) [21:00:46] (03PS1) 10Cwhite: hiera: fix statsd rules [puppet] - 10https://gerrit.wikimedia.org/r/527221 (https://phabricator.wikimedia.org/T205870) [21:01:59] (03CR) 10Cwhite: [C: 03+2] hiera: fix statsd rules [puppet] - 10https://gerrit.wikimedia.org/r/527221 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:02:08] (03PS2) 10Cwhite: hiera: fix statsd rules [puppet] - 10https://gerrit.wikimedia.org/r/527221 (https://phabricator.wikimedia.org/T205870) [21:03:48] (03Abandoned) 10Dzahn: varnish wikimedia-backend.vcl: replace cp1008 with cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/527177 (https://phabricator.wikimedia.org/T229586) (owner: 10Dzahn) [21:03:59] (03Abandoned) 10Dzahn: puppetmaster::frontend: remove cp1008 as a canary host [puppet] - 10https://gerrit.wikimedia.org/r/527180 (https://phabricator.wikimedia.org/T229586) (owner: 10Dzahn) [21:04:10] (03Abandoned) 10Dzahn: acme_chief: replace cp1008 with cp1099 as authorized host [puppet] - 10https://gerrit.wikimedia.org/r/527175 (https://phabricator.wikimedia.org/T229586) (owner: 10Dzahn) [21:13:09] (03PS1) 10Brennen Bearnes: Group1 and Group2 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527223 [21:13:11] (03CR) 10Brennen Bearnes: [C: 03+2] Group1 and Group2 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527223 (owner: 10Brennen Bearnes) [21:14:47] (03Merged) 10jenkins-bot: Group1 and Group2 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527223 (owner: 10Brennen Bearnes) [21:14:49] (03CR) 10jenkins-bot: Group1 and Group2 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527223 (owner: 10Brennen Bearnes) [21:14:53] (03CR) 10Dzahn: [C: 03+1] "thanks !:):)" [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532) (owner: 10Marostegui) [21:16:58] (03PS1) 10Brennen Bearnes: php symlink to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527224 [21:17:00] (03CR) 10Brennen Bearnes: [C: 03+2] php symlink to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527224 (owner: 10Brennen Bearnes) [21:17:40] bblack: Regarding "the ultimate fix for anything: give up and file a ticket and hope someone else fixes it :P". Wait until people start assigning tickets to you :D [21:19:06] (03Merged) 10jenkins-bot: php symlink to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527224 (owner: 10Brennen Bearnes) [21:19:21] (03CR) 10jenkins-bot: php symlink to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527224 (owner: 10Brennen Bearnes) [21:22:29] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 and group2 to 1.34.0-wmf.16 [21:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:53] !log brennen@deploy1001 Synchronized php: group1 and group2 to 1.34.0-wmf.16 (duration: 00m 46s) [21:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:46] (03CR) 10Cwhite: [C: 03+1] toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [21:34:24] PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:36:38] (03PS1) 10Dzahn: parsoid::testing: fix Hiera key to NOT install hhvm [puppet] - 10https://gerrit.wikimedia.org/r/527226 (https://phabricator.wikimedia.org/T228069) [21:37:24] RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:40:21] PROBLEM - HHVM rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:41:36] RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 75698 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:46:05] (03CR) 10Dzahn: [C: 03+2] parsoid::testing: fix Hiera key to NOT install hhvm [puppet] - 10https://gerrit.wikimedia.org/r/527226 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [21:46:13] (03PS2) 10Dzahn: parsoid::testing: fix Hiera key to NOT install hhvm [puppet] - 10https://gerrit.wikimedia.org/r/527226 (https://phabricator.wikimedia.org/T228069) [21:48:32] !log scandium - apt-get remove --purge hhvm* (T228069) [21:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:41] T228069: Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069 [21:51:27] i also have an issue with LVS, heh. i use "has_lvs: false" in Hiera for this host, scandium but on each puppet run i see it changing /etc/default/wikimedia-lvs-realserver content [21:51:40] and what it changes is ..it removes the LVS_SERVICE_IPS="" [21:51:47] then next run it does it again [21:53:35] (03CR) 10Aaron Schulz: [C: 03+1] noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) (owner: 10Krinkle) [21:55:28] (03PS1) 10DannyS712: Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) [21:58:11] (03PS2) 10DannyS712: Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) [21:59:47] (03CR) 10jerkins-bot: [V: 04-1] Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) (owner: 10DannyS712) [22:00:34] (03PS3) 10DannyS712: Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) [22:05:39] group2 deploy showed a problem with an eventlogging schema, shipping a schema version bump [22:05:42] brennen: ^ [22:08:13] ack. [22:08:16] action required at this point? [22:08:28] ^ ebernhardson [22:09:31] brennen: i'm deploying, just waiting on jenkins [22:10:13] cool, ty. [22:13:21] !log scandium apt-get remove --purge wikimedia-lvs-realserver (T228069) [22:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:30] T228069: Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069 [22:13:40] !log scandium apt-get autoremove [22:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:01] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) (owner: 10DannyS712) [22:17:42] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/WikimediaEvents/extension.json: T229614: Update eventlogging schema version to resolve eventlogging errors in wmf.16 (duration: 00m 47s) [22:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:50] T229614: tons of errors on eventlogging events - https://phabricator.wikimedia.org/T229614 [22:32:54] (03PS1) 10Dzahn: scandium: add has_lvs on node level in hiera [puppet] - 10https://gerrit.wikimedia.org/r/527232 [22:47:34] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@5ebf93e]: Update mobileapps to 2ee48ab [22:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:58] (03PS1) 10CDanis: Revert "dbctl: diff PHP vs dbctl configs" [puppet] - 10https://gerrit.wikimedia.org/r/527245 (https://phabricator.wikimedia.org/T229070) [22:52:08] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@5ebf93e]: Update mobileapps to 2ee48ab (duration: 04m 34s) [22:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:48] (03PS2) 10Dzahn: scandium: add has_lvs on node level in hiera [puppet] - 10https://gerrit.wikimedia.org/r/527232 [23:00:04] MaxSem, RoanKattouw, and Niharika: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:02:51] (03PS3) 10Dzahn: parsoid::testing: temp. comment out php-restarts include [puppet] - 10https://gerrit.wikimedia.org/r/527232 (https://phabricator.wikimedia.org/T228069) [23:08:45] (03PS4) 10Dzahn: parsoid::testing: temp. comment out php-restarts include [puppet] - 10https://gerrit.wikimedia.org/r/527232 (https://phabricator.wikimedia.org/T228069) [23:08:54] (03PS1) 10Dzahn: scandium: move Hiera key to disable systemd monitor to role level [puppet] - 10https://gerrit.wikimedia.org/r/527253 [23:09:07] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17713/scandium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/527232 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [23:10:23] (03PS5) 10Dzahn: parsoid::testing: temp. comment out php-restarts include [puppet] - 10https://gerrit.wikimedia.org/r/527232 (https://phabricator.wikimedia.org/T228069) [23:10:28] !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: T229614: Pass proper types to eventlogging to resolve eventlogging errors in wmf.16 (duration: 00m 47s) [23:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:37] T229614: tons of errors on eventlogging events - https://phabricator.wikimedia.org/T229614 [23:10:40] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:16:29] (03PS1) 10DannyS712: Add `autopatroller` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) [23:16:34] (03PS1) 10Bstorm: sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527258 (https://phabricator.wikimedia.org/T229058) [23:18:06] * Urbanecm is going to deploy a few things [23:18:39] (03CR) 10Urbanecm: [C: 03+2] Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) (owner: 10DannyS712) [23:19:01] (03PS2) 10DannyS712: Add `autopatrolled` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) [23:19:45] (03Merged) 10jenkins-bot: Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) (owner: 10DannyS712) [23:19:55] (03CR) 10Volans: [C: 03+1] "LGTM, but don't forget to remove the files manually from the affected hosts ;)" [puppet] - 10https://gerrit.wikimedia.org/r/527245 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [23:20:00] (03CR) 10jenkins-bot: Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) (owner: 10DannyS712) [23:22:58] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: cf01272: Add importing to english wikiquote (T228607) (duration: 00m 48s) [23:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:06] T228607: Add importing to en.wq - https://phabricator.wikimedia.org/T228607 [23:26:10] (03PS1) 10Urbanecm: Remove the "autoreview" user group from ru.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527264 (https://phabricator.wikimedia.org/T229596) [23:26:29] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527264 (https://phabricator.wikimedia.org/T229596) (owner: 10Urbanecm) [23:27:28] (03Merged) 10jenkins-bot: Remove the "autoreview" user group from ru.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527264 (https://phabricator.wikimedia.org/T229596) (owner: 10Urbanecm) [23:27:32] (03CR) 10Urbanecm: [C: 03+2] Add `autopatrolled` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) (owner: 10DannyS712) [23:27:43] (03CR) 10jenkins-bot: Remove the "autoreview" user group from ru.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527264 (https://phabricator.wikimedia.org/T229596) (owner: 10Urbanecm) [23:27:53] (03PS3) 10Urbanecm: Add `autopatrolled` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) (owner: 10DannyS712) [23:28:10] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) (owner: 10DannyS712) [23:29:16] (03Merged) 10jenkins-bot: Add `autopatrolled` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) (owner: 10DannyS712) [23:29:24] !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: 8aca0eb: Remove the "autoreview" user group from ru.wikipedia (T229596) (duration: 00m 47s) [23:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:35] T229596: Remove the "autoreview" user group from ru.wikipedia - https://phabricator.wikimedia.org/T229596 [23:29:37] (03CR) 10jenkins-bot: Add `autopatrolled` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) (owner: 10DannyS712) [23:29:39] (03CR) 10CDanis: [C: 03+1] "LGTM but see also https://phabricator.wikimedia.org/T229631" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) (owner: 10Krinkle) [23:30:32] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:30:52] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:31:42] PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:32:00] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 819073a: Add `autopatrolled` group to az wikisource (T229371) (duration: 00m 49s) [23:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:10] T229371: Add autopatroller user group to az.wikisource - https://phabricator.wikimedia.org/T229371 [23:32:38] !log Evening SWAT done [23:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:10] RECOVERY - PHP7 rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 75779 bytes in 0.564 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:35:22] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:35:38] that was me.. had to merge [23:35:42] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:35:58] mutante: did puppet-merge not automatically perform the labs merge first? [23:36:32] cdanis: i just hit submit on gerrit and then got distracted ..is all [23:36:42] ahh okay :) [23:38:20] might be nice to have the bot log "user X ran puppet-merge" the way it does when we use conftool.. shrug [23:40:58] (03PS2) 10Dzahn: scandium: move Hiera key to disable systemd monitor to role level [puppet] - 10https://gerrit.wikimedia.org/r/527253 [23:41:28] (03CR) 10Dzahn: [C: 03+2] "noop https://puppet-compiler.wmflabs.org/compiler1002/17715/" [puppet] - 10https://gerrit.wikimedia.org/r/527253 (owner: 10Dzahn) [23:46:06] (03CR) 10CDanis: [C: 03+2] dbctl: require commit messages [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis) [23:48:45] (03Merged) 10jenkins-bot: dbctl: require commit messages [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis)