[00:00:04] <jouncebot>	 twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T0000).
[00:02:08] <wikibugs>	 (03PS8) 10Jeena Huneidi: Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909)
[00:03:11] <wikibugs>	 (03CR) 10Jeena Huneidi: Add Parsoid chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909) (owner: 10Jeena Huneidi)
[00:04:05] <wikibugs>	 (03PS9) 10Jeena Huneidi: Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909)
[00:22:11] * Krinkle staginb on mwdebug1002
[00:28:22] <logmsgbot>	 !log krinkle@deploy1001 sync-file aborted: composer.json composer.lock dblists debug.json docroot errorpages fc-list fonts images langlist langlist-labs multiversion php php-1.34.0-wmf.13 php-1.34.0-wmf.14 php-1.34.0-wmf.15 php-1.34.0-wmf.16 phpcs.xml phpunit.xml portals private README requirements.txt robots.txt rpc scap setup.py src static test-requirements.txt tests tox.ini typos vendor w wikiversions.json wikiversions-labs.js
[00:28:22] <logmsgbot>	 fig List of module names that contain QUnit test suites (duration: 00m 01s)
[00:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:32] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.16/includes/resourceloader/ResourceLoader.php: acfff6751f3b8f7650 (duration: 00m 55s)
[00:30:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:01] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.16/includes/specials/SpecialJavaScriptTest.php: acfff6751f3b8f7650 (duration: 00m 54s)
[00:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:33:08] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.16/resources/Resources.php: acfff6751f3b8f7650 (duration: 00m 54s)
[00:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:17] <brennen>	 train is presently still blocked on T229482.  i'm signing off and will resume efforts to move forward at 08:00 MDT / 14:00 UTC.
[00:36:18] <stashbot>	 T229482: PHP Warning: Wikibase\Lib\Store\Sql\WikiPageEntityRevisionLookup::getEntityRevision: Entity not loaded - https://phabricator.wikimedia.org/T229482
[00:50:51] <icinga-wm>	 PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:10:15] <wikibugs>	 (03PS1) 10Ayounsi: Prometheus, collect Netbox metrics [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331)
[01:18:45] <icinga-wm>	 RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:26:48] <wikibugs>	 (03PS2) 10Ayounsi: Prometheus, collect Netbox metrics [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331)
[01:33:03] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/17700/" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi)
[01:40:40] <wikibugs>	 (03CR) 10Ayounsi: "This doesn't work." [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi)
[01:57:27] <icinga-wm>	 PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:12:21] <icinga-wm>	 PROBLEM - puppet last run on wtp1032 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:25:27] <icinga-wm>	 RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:40:19] <icinga-wm>	 RECOVERY - puppet last run on wtp1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:59:03] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 50.21, 24.55, 13.98 https://wikitech.wikimedia.org/wiki/Application_servers
[02:59:13] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 53.53, 25.34, 13.86 https://wikitech.wikimedia.org/wiki/Application_servers
[02:59:25] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 56.21, 27.80, 15.18 https://wikitech.wikimedia.org/wiki/Application_servers
[03:00:11] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:00:41] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 13.21, 19.68, 13.37 https://wikitech.wikimedia.org/wiki/Application_servers
[03:00:49] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20580336 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:00:53] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 18.52, 22.39, 14.07 https://wikitech.wikimedia.org/wiki/Application_servers
[03:00:53] <icinga-wm>	 PROBLEM - HHVM rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:01:03] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 19.93, 23.87, 15.06 https://wikitech.wikimedia.org/wiki/Application_servers
[03:01:27] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out 
[03:01:27] <icinga-wm>	  was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{doma
[03:01:27] <icinga-wm>	 m/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[03:01:33] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[03:01:33] <icinga-wm>	 received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:01:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[03:01:35] <icinga-wm>	 received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:01:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[03:01:35] <icinga-wm>	 received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/
[03:01:35] <icinga-wm>	 itoring/recommendation_api
[03:01:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:01:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa
[03:01:37] <icinga-wm>	 ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:01:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was receiv
[03:01:38] <icinga-wm>	 article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:01:38] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received h
[03:01:39] <icinga-wm>	 ikimedia.org/wiki/Services/Monitoring/mobileapps
[03:01:45] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before
[03:01:45] <icinga-wm>	 eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:01:45] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o
[03:01:46] <icinga-wm>	 nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:01:46] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[03:01:46] <icinga-wm>	 received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/
[03:01:46] <icinga-wm>	 itoring/recommendation_api
[03:01:57] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:01:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:01:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:01:59] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:01:59] <icinga-wm>	 PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:02:05] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:02:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:02:09] <icinga-wm>	 PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:02:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:02:13] <icinga-wm>	 PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:02:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:02:15] <icinga-wm>	 PROBLEM - HHVM rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:02:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:02:19] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 73.36, 40.89, 22.90 https://wikitech.wikimedia.org/wiki/Application_servers
[03:02:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[03:02:39] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:02:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:02:47] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:03:01] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[03:03:09] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:03:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:03:11] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[03:03:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:03:13] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:03:13] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:03:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:03:19] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 660 bytes in 0.234 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:03:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:03:25] <icinga-wm>	 RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 658 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:25] <icinga-wm>	 RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 659 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:27] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.574 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:29] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 659 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 76068 bytes in 1.393 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:31] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1316 is CRITICAL: CRITICAL - load average: 78.77, 47.30, 28.27 https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:33] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 76114 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:03:37] <icinga-wm>	 RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:37] <icinga-wm>	 RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 76067 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:03:43] <icinga-wm>	 RECOVERY - HHVM rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 76067 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:45] <icinga-wm>	 RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:03:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:03:59] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 56.22, 30.20, 18.09 https://wikitech.wikimedia.org/wiki/Application_servers
[03:04:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 76114 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:04:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19283016 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:04:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 69.40, 35.61, 19.74 https://wikitech.wikimedia.org/wiki/Application_servers
[03:04:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:04:13] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:04:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:04:33] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:04:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:04:45] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 55.65, 33.97, 20.18 https://wikitech.wikimedia.org/wiki/Application_servers
[03:04:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:04:55] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:05:05] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 64.86, 36.46, 20.50 https://wikitech.wikimedia.org/wiki/Application_servers
[03:05:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 92137704 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:05:21] <icinga-wm>	 PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:05:33] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 13.33, 26.71, 20.76 https://wikitech.wikimedia.org/wiki/Application_servers
[03:05:51] <icinga-wm>	 PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:06:01] <icinga-wm>	 PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:06:45] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1316 is OK: OK - load average: 16.22, 32.03, 25.94 https://wikitech.wikimedia.org/wiki/Application_servers
[03:07:11] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 12.92, 24.69, 18.59 https://wikitech.wikimedia.org/wiki/Application_servers
[03:07:23] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 10.48, 23.75, 18.18 https://wikitech.wikimedia.org/wiki/Application_servers
[03:07:37] <icinga-wm>	 PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:07:53] <icinga-wm>	 PROBLEM - puppet last run on an-worker1078 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:07:59] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 11.42, 22.91, 18.46 https://wikitech.wikimedia.org/wiki/Application_servers
[03:08:31] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 80427384 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:09:53] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 8.04, 20.70, 18.30 https://wikitech.wikimedia.org/wiki/Application_servers
[03:09:55] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 124600 and 37 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:10:31] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 46624 and 73 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:13:25] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3104 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:14:31] <wikibugs>	 (03PS2) 10Ejegg: CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019)
[03:30:19] <icinga-wm>	 RECOVERY - puppet last run on an-worker1078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:32:33] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:33:19] <icinga-wm>	 RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:33:53] <icinga-wm>	 RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:34:05] <icinga-wm>	 RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:35:35] <icinga-wm>	 RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[04:08:23] <icinga-wm>	 PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[04:18:51] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:26:02] <wikibugs>	 10Operations, 10netops, 10observability: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 (10ayounsi) Good news, this is already implemented with: https://github.com/librenms/librenms/pull/9879  Bad news, for unknown reasons so far, the switches don't expose the proper interface data. For...
[04:36:29] <icinga-wm>	 RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[04:44:46] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) Thanks @Papaul I have started MySQL again, let's monitor the host for a few days
[04:46:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] DNS: Remove DNS entires for db2042 [dns] - 10https://gerrit.wikimedia.org/r/526762 (owner: 10Papaul)
[04:48:29] <wikibugs>	 (03CR) 10Marostegui: "Ooooh sweet! Thanks! :)" [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis)
[04:51:35] <wikibugs>	 (03PS1) 10Marostegui: db2058: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526836 (https://phabricator.wikimedia.org/T229449)
[04:52:11] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T229449 (10Marostegui) As expected, controller failure: ` /system1/log1/record14   Targets   Properties     number=14     severity=Critical     date=07/31/2019     time=16:51     description=Drive Array...
[04:52:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2058: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526836 (https://phabricator.wikimedia.org/T229449) (owner: 10Marostegui)
[04:53:05] <icinga-wm>	 PROBLEM - puppet last run on dns5002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[04:54:34] <wikibugs>	 10Operations, 10DBA: db2058: Broken storage - https://phabricator.wikimedia.org/T229449 (10Marostegui)
[04:55:11] <wikibugs>	 10Operations, 10DBA: db2058: Broken storage - https://phabricator.wikimedia.org/T229449 (10Marostegui) 05Open→03Declined I am going to close this as this host will be decommissioned {T228258}
[04:59:19] <wikibugs>	 10Operations, 10DBA, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Marostegui)
[04:59:49] <wikibugs>	 10Operations, 10DBA, 10decommission: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Marostegui) p:05Triage→03Normal
[04:59:58] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[05:04:25] <wikibugs>	 (03PS1) 10Marostegui: filtered_tables: Remove abuse_filter_log.afl_log_id [puppet] - 10https://gerrit.wikimedia.org/r/526837 (https://phabricator.wikimedia.org/T226851)
[05:14:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] filtered_tables: Remove abuse_filter_log.afl_log_id [puppet] - 10https://gerrit.wikimedia.org/r/526837 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui)
[05:20:21] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops
[05:21:07] <icinga-wm>	 RECOVERY - puppet last run on dns5002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[05:22:59] <wikibugs>	 (03PS1) 10Marostegui: db212[5-6]: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526838 (https://phabricator.wikimedia.org/T228969)
[05:24:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db212[5-6]: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526838 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[05:47:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526839
[05:52:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "The code seems correct to me but I'm not sure we want every commit to have a message, esp if we're in the middle of a series of commits an" [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis)
[06:11:54] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief: Save the certificate using the 3 save modes on issuance time [software/acme-chief] - 10https://gerrit.wikimedia.org/r/526840 (https://phabricator.wikimedia.org/T229096)
[06:17:37] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976)
[06:18:38] <_joe_>	 !log depooling mw1348 while moving it to no hhvm support.
[06:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:19:56] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1348.eqiad.wmnet
[06:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto)
[06:21:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:22:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:28:34] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw1348.eqiad.wmnet
[06:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:13] <icinga-wm>	 PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:32:11] <icinga-wm>	 PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:32:27] <icinga-wm>	 PROBLEM - puppet last run on db2115 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:35:31] <icinga-wm>	 PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:36:10] <wikibugs>	 (03PS1) 10Elukey: Remove Spark2 sasl config from the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/526843 (https://phabricator.wikimedia.org/T226698)
[06:36:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove Spark2 sasl config from the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/526843 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[06:40:37] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:41:40] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: make mw1270 a php7-only application server [puppet] - 10https://gerrit.wikimedia.org/r/526720
[06:42:03] <_joe_>	 !log depooling mw1270 while migrating it to pure-php7 
[06:42:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:21] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1270.eqiad.wmnet
[06:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: make mw1270 a php7-only application server [puppet] - 10https://gerrit.wikimedia.org/r/526720 (owner: 10Giuseppe Lavagetto)
[06:48:20] <wikibugs>	 (03PS1) 10Elukey: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849
[06:49:02] <wikibugs>	 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe)
[06:49:31] <wikibugs>	 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe)
[06:50:29] <wikibugs>	 (03PS2) 10Elukey: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849
[06:50:45] <wikibugs>	 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe)
[06:51:44] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw1270.eqiad.wmnet
[06:51:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:20] <wikibugs>	 (03PS3) 10Elukey: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849
[06:58:30] <icinga-wm>	 RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:58:44] <icinga-wm>	 RECOVERY - puppet last run on db2115 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:59:04] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:59:44] <elukey>	 !log install python3-docopt manually on lithium to test check_anycast_healthchecker
[06:59:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:40] <icinga-wm>	 RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[07:04:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:07:00] <wikibugs>	 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10elukey) The link went down again:  ` elukey@re0.cr2-eqiad> show interfaces descriptions | match down xe-4/1/3        up    down Transport: cr2-esams:xe-0/1/3 (Level3, BDFS2448,...
[07:07:31] <volans>	 elukey: the router interface is the same you reported earlier?
[07:07:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526839 (owner: 10Marostegui)
[07:07:50] <volans>	 never mind, I should read more the timestamp around messages :D
[07:07:53] * volans still waking up
[07:08:25] <elukey>	 volans: yep yep commented in security
[07:08:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526839 (owner: 10Marostegui)
[07:08:42] <elukey>	 from the task it seems all good, Arzel drained the link
[07:08:56] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1112" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526839 (owner: 10Marostegui)
[07:09:20] <icinga-wm>	 RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[07:09:50] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1112 (duration: 00m 54s)
[07:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8843', previous config saved to /var/cache/conftool/dbconfig/20190801-071022-marostegui.json
[07:10:24] <wikibugs>	 (03PS4) 10Elukey: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849
[07:10:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:03] <wikibugs>	 (03CR) 10Marostegui: "> The code seems correct to me but I'm not sure we want every commit" [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis)
[07:13:15] <wikibugs>	 (03PS5) 10Elukey: profile:bird::anycast_healthchecker_monitoring: add python3-docopt [puppet] - 10https://gerrit.wikimedia.org/r/526849
[07:16:34] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:16:48] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:17:52] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Provision db2129 into s6 [puppet] - 10https://gerrit.wikimedia.org/r/526935 (https://phabricator.wikimedia.org/T228969)
[07:18:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2129 into s6 [puppet] - 10https://gerrit.wikimedia.org/r/526935 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:25:14] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10elukey) 05Resolved→03Open Couple of notes about the anycast-healthchecker:  1) the `anycast-healthchecker` is not in jessie-wikimedia, so puppet on lithium/wezen is currently broken: ` r...
[07:25:30] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10elukey)
[07:25:32] <icinga-wm>	 ACKNOWLEDGEMENT - Check if anycast-healthchecker and all configured threads are running on lithium is CRITICAL: NRPE: Command check_anycast_healthchecker not defined Elukey T186550 https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS%23Anycast_healthchecker_not_running
[07:27:35] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Provision db2126 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526944 (https://phabricator.wikimedia.org/T228969)
[07:27:43] <_joe_>	 !log removing mw1348 from rotation - reimaging for T228976
[07:27:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:52] <stashbot>	 T228976: Allow to avoid installing HHVM from the mediawiki puppet module and profile - https://phabricator.wikimedia.org/T228976
[07:29:12] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=mw1348.eqiad.wmnet
[07:29:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:54] <wikibugs>	 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10elukey) Added patch to the Debian bug in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763858#10
[07:31:52] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526944 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:34:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Provision db2126 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526944 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:35:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8844', previous config saved to /var/cache/conftool/dbconfig/20190801-073459-marostegui.json
[07:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:06] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db2126 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526944 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[07:37:19] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/526840 (https://phabricator.wikimedia.org/T229096) (owner: 10Vgutierrez)
[07:37:21] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db2126 into s2 T228969 (duration: 00m 54s)
[07:37:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:40] <stashbot>	 T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969
[07:38:26] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2126 into s2 T228969 (duration: 00m 55s)
[07:38:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:23] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db2126 into s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526944 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[08:20:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add Parsoid chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/525481 (https://phabricator.wikimedia.org/T228909) (owner: 10Jeena Huneidi)
[08:20:51] <wikibugs>	 10Operations, 10Security-Team, 10Traffic: Consider removing X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10ema) @dduvall: any reason not to proceed with the removal?
[08:31:19] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Volans) @Joe any feedback on the above proposal? I'd really like to split the users ASAP given that dbctl is being deployed.
[08:39:52] <wikibugs>	 (03CR) 10Volans: "I'm ok with the UI of requiring the message and !logging for each write action to the mwconfig object read by MW." (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis)
[08:48:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: site: lithium to spare [puppet] - 10https://gerrit.wikimedia.org/r/526980 (https://phabricator.wikimedia.org/T229557)
[08:48:33] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done, decom for lithium is at {T229557}
[08:53:51] <wikibugs>	 (03PS4) 10Filippo Giunchedi: toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406)
[09:00:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10aborrero) The dates you mention the WMCS team will be barely available because travel/wikimania/offsites, etc. Since the racks are "easy" for us, this shouldn't be a bloc...
[09:02:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Let's deploy carefully?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz)
[09:02:46] <wikibugs>	 (03PS5) 10Filippo Giunchedi: toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406)
[09:02:52] <wikibugs>	 (03CR) 10Volans: "Couple of comments/questions inline, looks good otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521313 (owner: 10CRusnov)
[09:08:42] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10elukey) 05Resolved→03Open p:05High→03Normal
[09:08:47] <wikibugs>	 10Operations, 10Analytics, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey)
[09:09:04] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10elukey) We didn't discuss if SERVICE UNKNOWN needs to alarm or not for some services :)
[09:13:06] <wikibugs>	 (03PS1) 10Urbanecm: flaggedrevs.php: Remove useless wgAddGroups/wgRemoveGroups declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527003
[09:15:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526819 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi)
[09:16:11] <wikibugs>	 10Operations, 10Traffic: fifo-log-tailer: evergrowing memory usage - https://phabricator.wikimedia.org/T229414 (10ema) I've been digging a bit further and reproduced this on my workstation with the following program:  `lang=go // growmem.go package main  import (         "io/ioutil"         "os" )  func main()...
[09:17:37] <wikibugs>	 (03PS6) 10Filippo Giunchedi: toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406)
[09:19:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: restrouter: Add helmfile stanzas (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/526719 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[09:19:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/17705/" [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi)
[09:20:06] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: restrouter: Add helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/526719 (https://phabricator.wikimedia.org/T223953)
[09:21:13] <wikibugs>	 10Operations, 10DBA: db2058: Broken storage - https://phabricator.wikimedia.org/T229449 (10Marostegui) I rebooted the server and this is the boot message: ` Slot 0  HP Smart Array P420i Controller        (1 GB, v6.00)  1 Logical Drive 1719-Slot 0 Drive Array - A controller failure event occurred prior to this...
[09:21:52] <icinga-wm>	 RECOVERY - MariaDB disk space on db2058 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[09:22:40] <icinga-wm>	 RECOVERY - Disk space on db2058 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2058&var-datasource=codfw+prometheus/ops
[09:24:55] <wikibugs>	 (03PS1) 10Vgutierrez: fifo-log-demux: Keep attempting to read the FIFO after EOF [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013
[09:25:06] <icinga-wm>	 RECOVERY - HP RAID on db2058 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:29:52] <Urbanecm>	 jouncebot, next
[09:29:52] <jouncebot>	 In 1 hour(s) and 30 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1100)
[09:34:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall, see comment re: metric names" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526782 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[09:35:34] <icinga-wm>	 RECOVERY - Check systemd state on db2058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:39:20] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on db2058 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[09:41:16] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on db2058 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[09:50:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/526791 (owner: 10Ayounsi)
[09:51:09] <wikibugs>	 (03PS2) 10Jbond: urbanecm's dotfiles: gitconfig: Add push-for-review, use SSH for pushing [puppet] - 10https://gerrit.wikimedia.org/r/526796 (owner: 10Urbanecm)
[09:52:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] urbanecm's dotfiles: gitconfig: Add push-for-review, use SSH for pushing [puppet] - 10https://gerrit.wikimedia.org/r/526796 (owner: 10Urbanecm)
[09:59:30] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: restrouter: Add helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/526719 (https://phabricator.wikimedia.org/T223953)
[09:59:57] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: restrouter: Add kubernetes stanzas [puppet] - 10https://gerrit.wikimedia.org/r/526632 (https://phabricator.wikimedia.org/T223953)
[10:04:11] <wikibugs>	 (03CR) 10Jbond: "looks good, one nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526849 (owner: 10Elukey)
[10:08:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "Thanks for working on it! See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526611 (https://phabricator.wikimedia.org/T229357) (owner: 10Elukey)
[10:10:00] <_joe_>	 !log repooling mw1348 after reimaging as pure-php7 
[10:10:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:23] <jbond42>	 !log rolling upgrade for patch
[10:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:37] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): vcl: add Access-Control-Allow-Origin to mobile redirects [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385)
[10:12:39] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): vcl: add Access-Control-Allow-Origin to mobile redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) (owner: 10Lucas Werkmeister (WMDE))
[10:18:04] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10fgiunchedi) Thanks @elukey ! Indeed anycast-healthchecker isn't in jessie-wikimedia, lithium is being decom'd and if wezen gets reinstalled it'll be buster, and I installed anycast-healthche...
[10:19:27] <wikibugs>	 (03PS2) 10Vgutierrez: fifo-log-demux: Keep attempting to read the FIFO after EOF [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013
[10:19:29] <wikibugs>	 (03PS1) 10Vgutierrez: fifo-log-demux: Deprecate socket activation [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527039
[10:22:41] <wikibugs>	 (03PS1) 10Jbond: mysql: remove grants for sarin and neodymium [puppet] - 10https://gerrit.wikimedia.org/r/527043
[10:24:19] <wikibugs>	 (03PS2) 10Jbond: mysql: remove grants for sarin and neodymium [puppet] - 10https://gerrit.wikimedia.org/r/527043 (https://phabricator.wikimedia.org/T220503)
[10:24:34] <wikibugs>	 (03PS3) 10Jbond: mysql: remove grants for sarin and neodymium [puppet] - 10https://gerrit.wikimedia.org/r/527043 (https://phabricator.wikimedia.org/T220503)
[10:29:33] <wikibugs>	 (03PS2) 10Vgutierrez: fifo-log-demux: Remove socket activation [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527039
[10:29:35] <wikibugs>	 (03PS3) 10Vgutierrez: fifo-log-demux: Keep attempting to read the FIFO after EOF [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013
[10:30:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the LBRemoteCluster class. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[10:30:52] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "Looks good, as a reminder we should get rid of socket activation support from puppet too." [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527039 (owner: 10Vgutierrez)
[10:32:34] <wikibugs>	 (03PS15) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947
[10:35:03] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527052
[10:35:35] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[10:36:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[10:39:02] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527056 (https://phabricator.wikimedia.org/T228657)
[10:40:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527056 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[10:40:58] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527058 (https://phabricator.wikimedia.org/T229543)
[10:46:19] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527058 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui)
[10:47:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527058 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui)
[10:47:45] <wikibugs>	 10Operations, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543 (10Marostegui)
[10:48:15] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527058 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui)
[10:48:30] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527058 (https://phabricator.wikimedia.org/T229543) (owner: 10Marostegui)
[10:50:13] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2058 from config T229543 (duration: 00m 55s)
[10:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:21] <stashbot>	 T229543: Decommission db2058.codfw.wmnet - https://phabricator.wikimedia.org/T229543
[10:50:25] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527064 (https://phabricator.wikimedia.org/T228657)
[10:51:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527064 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[10:51:15] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2058 from config T229543 (duration: 00m 57s)
[10:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:11] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on bast4002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. John Bond Testing against new puppet master https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:58:13] <wikibugs>	 (03PS16) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1100).
[11:00:04] <jouncebot>	 kart_, Urbanecm, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:21] <Amir1>	 o/
[11:00:22] <Urbanecm>	 I can SWAT today!
[11:00:25] <kart_>	 Urbanecm: I've wmf.16 patch, you can go ahead with your config patches first.
[11:00:32] <Urbanecm>	 ack
[11:00:38] <kart_>	 Urbanecm: If you can do my patch, that's great. Already +2ed though.
[11:00:51] <Urbanecm>	 kart_, if you want me to, happy to SWAT yours too!
[11:01:02] <kart_>	 Urbanecm: Please do :)
[11:01:05] <Urbanecm>	 ok
[11:01:12] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526492 (https://phabricator.wikimedia.org/T229346) (owner: 10Urbanecm)
[11:02:13] <wikibugs>	 (03Merged) 10jenkins-bot: flaggedrevs.php: Allow wikis to remove ability to promote to/demote from autoreview/editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526492 (https://phabricator.wikimedia.org/T229346) (owner: 10Urbanecm)
[11:02:30] <wikibugs>	 (03CR) 10jenkins-bot: flaggedrevs.php: Allow wikis to remove ability to promote to/demote from autoreview/editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526492 (https://phabricator.wikimedia.org/T229346) (owner: 10Urbanecm)
[11:02:32] <wikibugs>	 (03PS2) 10Urbanecm: flaggedrevs.php: Remove useless wgAddGroups/wgRemoveGroups declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527003
[11:02:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527003 (owner: 10Urbanecm)
[11:02:53] <Urbanecm>	 pulled the merged patch onto mwdebug1002
[11:03:34] <Urbanecm>	 syncing
[11:05:08] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: aa82657: flaggedrevs.php: Allow wikis to remove ability to promote to/demote from autoreview/editor (T229346) (duration: 00m 54s)
[11:05:14] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Specify candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/527073
[11:05:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:18] <stashbot>	 T229346: Administrators of the Hungarian Wikipedia have unapproved right - https://phabricator.wikimedia.org/T229346
[11:06:00] <wikibugs>	 (03Merged) 10jenkins-bot: flaggedrevs.php: Remove useless wgAddGroups/wgRemoveGroups declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527003 (owner: 10Urbanecm)
[11:07:52] <wikibugs>	 (03PS1) 10Vgutierrez: fifo_log_demux: Remove socket activation [puppet] - 10https://gerrit.wikimedia.org/r/527075
[11:07:54] <wikibugs>	 (03CR) 10jenkins-bot: flaggedrevs.php: Remove useless wgAddGroups/wgRemoveGroups declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527003 (owner: 10Urbanecm)
[11:09:37] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: 7db98f3: flaggedrevs.php: Remove useless wgAddGroups/wgRemoveGroups declarations (duration: 00m 55s)
[11:09:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:29] <wikibugs>	 (03CR) 10Vgutierrez: "pcc seems happy almost showing a NOOP: https://puppet-compiler.wmflabs.org/compiler1002/17707/" [puppet] - 10https://gerrit.wikimedia.org/r/527075 (owner: 10Vgutierrez)
[11:11:46] <wikibugs>	 (03CR) 10Marostegui: "noop as expected: https://puppet-compiler.wmflabs.org/compiler1001/17708/" [puppet] - 10https://gerrit.wikimedia.org/r/527073 (owner: 10Marostegui)
[11:11:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Specify candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/527073 (owner: 10Marostegui)
[11:11:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[11:12:04] <Urbanecm>	 kart_, Amir1: Done with my patches, waiting on CI for kart_'s backport. Amir1, do you want me to deploy your patch, or do you prefer deploying it yourself?
[11:12:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[11:14:05] <Urbanecm>	 kart_, your backport is merged
[11:14:06] <Urbanecm>	 processing it
[11:14:51] <Urbanecm>	 kart_, your patch is on mwdebug1002
[11:15:11] <kart_>	 cool.
[11:15:22] <kart_>	 Nothing to check. Go ahead :)
[11:15:48] <Urbanecm>	 ok kart_ 
[11:16:16] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: Fix regression [puppet] - 10https://gerrit.wikimedia.org/r/527078 (https://phabricator.wikimedia.org/T228657)
[11:16:51] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: Fix regression [puppet] - 10https://gerrit.wikimedia.org/r/527078 (https://phabricator.wikimedia.org/T228657)
[11:16:54] <Amir1>	 Urbanecm: it would be great if you deploy it
[11:17:49] <Urbanecm>	 Amir1, will do
[11:17:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: Fix regression [puppet] - 10https://gerrit.wikimedia.org/r/527078 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[11:18:11] <wikibugs>	 (03Merged) 10jenkins-bot: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[11:19:06] <wikibugs>	 (03PS2) 10Urbanecm: Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527052 (owner: 10Ladsgroup)
[11:19:08] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/ExternalGuidance/: SWAT: 9402c36: Provide the messages in the target language of translation (T228019) (duration: 00m 56s)
[11:19:12] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527052 (owner: 10Ladsgroup)
[11:19:14] <wikibugs>	 (03CR) 10jenkins-bot: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[11:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:16] <stashbot>	 T228019: Injected info does not get translated - https://phabricator.wikimedia.org/T228019
[11:19:36] <Amir1>	 Thanks!
[11:19:43] <Amir1>	 Urbanecm: It's not testable
[11:19:48] <Urbanecm>	 Amir1, ack
[11:20:20] <kart_>	 Thanks Urbanecm 
[11:20:22] <Urbanecm>	 yw kart_ 
[11:20:58] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Joe) >>! In T97972#5353056, @Volans wrote: >>>! In T97972#5352851, @Joe wrote: >> IIRC we already have an account specialized for accessi...
[11:21:56] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527052 (owner: 10Ladsgroup)
[11:22:11] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527052 (owner: 10Ladsgroup)
[11:22:57] <Urbanecm>	 Amir1, syncing
[11:23:06] <Amir1>	 marostegui: ^
[11:23:07] <Amir1>	 Thanks
[11:23:10] <marostegui>	 yep, I am ready
[11:23:49] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: c164132: Revert "Revert "Switch property terms migration to WRITE_NEW on production wikidata"" (T225053) (duration: 00m 55s)
[11:23:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:57] <stashbot>	 T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053
[11:24:12] <Urbanecm>	 Amir1, marostegui: Patch was synced
[11:26:42] <marostegui>	 Amir1: I am starting to see the read_key handler spiking, let's see what it does
[11:27:04] <wikibugs>	 (03CR) 10Mobrovac: "We are in the process of splitting RESTBase into two services in production, so I'd advocate for pushing this a bit down the line and add " [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[11:31:30] <wikibugs>	 (03PS1) 10Urbanecm: Add nlm.nih.gov to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527082 (https://phabricator.wikimedia.org/T229470)
[11:32:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527082 (https://phabricator.wikimedia.org/T229470) (owner: 10Urbanecm)
[11:33:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add nlm.nih.gov to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527082 (https://phabricator.wikimedia.org/T229470) (owner: 10Urbanecm)
[11:33:25] <wikibugs>	 (03CR) 10jenkins-bot: Add nlm.nih.gov to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527082 (https://phabricator.wikimedia.org/T229470) (owner: 10Urbanecm)
[11:34:57] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 1e4458e: Add nlm.nih.gov to the wgCopyUploadsDomains whitelist for commonswiki (T229470) (duration: 00m 53s)
[11:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:05] <stashbot>	 T229470: Add nlm.nih.gov to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T229470
[11:36:59] <wikibugs>	 (03PS1) 10Urbanecm: Add files.geocollections.info to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527084 (https://phabricator.wikimedia.org/T229547)
[11:37:43] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527084 (https://phabricator.wikimedia.org/T229547) (owner: 10Urbanecm)
[11:38:31] <marostegui>	 Amir1: It keeps increasing, but it is not a bad thing, that handler means the reads are being done from an index
[11:38:41] <wikibugs>	 (03Merged) 10jenkins-bot: Add files.geocollections.info to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527084 (https://phabricator.wikimedia.org/T229547) (owner: 10Urbanecm)
[11:38:58] <wikibugs>	 (03CR) 10jenkins-bot: Add files.geocollections.info to the wgCopyUploadsDomains whitelist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527084 (https://phabricator.wikimedia.org/T229547) (owner: 10Urbanecm)
[11:39:25] <Amir1>	 marostegui: yeah. Let's see when it stops. Also, there are plans to improve it but it might take a week or two to get it there
[11:39:48] <marostegui>	 Amir1: the query latency remains the same, so there is not a degradation there
[11:39:51] <marostegui>	 the traffic has increased
[11:40:00] <marostegui>	 but not ther amount of queries or the processes
[11:40:11] <marostegui>	 so we are just reading more, but so far, fast enough
[11:40:49] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: c51baa3: Add files.geocollections.info to the wgCopyUploadsDomains whitelist for commonswiki (T229547) (duration: 00m 55s)
[11:40:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:57] <stashbot>	 T229547: Add files.geocollections.info to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T229547
[11:41:08] <icinga-wm>	 PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27460 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops
[11:41:38] <Amir1>	 the traffic should not increase that much.
[11:41:48] <wikibugs>	 (03CR) 10Krinkle: Use GTIDs for master position queries for external DB when possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz)
[11:42:18] <Urbanecm>	 !log EU SWAT done
[11:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:26] <marostegui>	 Amir1: would you expect this change on queries? https://grafana.wikimedia.org/d/000000273/mysql?panelId=31&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&refresh=10s&from=now-3h&to=now
[11:43:20] <Amir1>	 I don't think so
[11:43:23] <Amir1>	 I can double check
[11:44:01] <marostegui>	 I think we are having contention now https://grafana.wikimedia.org/d/000000273/mysql?panelId=23&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&refresh=10s&from=now-3h&to=now
[11:44:59] <marostegui>	 The read_key handler is now decreasing, let's give it some more time, it might be getting everything in memory
[11:45:59] <marostegui>	 Amir1: the traffic has basically shifted back to previous values
[11:46:30] <marostegui>	 Amir1: Is everything ok, queries are going down
[11:46:32] <marostegui>	 ?
[11:46:35] <wikibugs>	 (03PS6) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246)
[11:47:00] <Amir1>	 yes
[11:47:03] <Amir1>	 it looks good
[11:47:19] <marostegui>	 ok, so yesterday's theory about the buffer pool seems correct
[11:47:24] <Amir1>	 It was probably putting things into the program cache
[11:47:34] <Amir1>	 *application cache
[11:47:49] <marostegui>	 key handler also decreasing
[11:47:54] <wikibugs>	 (03PS1) 10Jbond: labpuppetmaster: add missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/527085 (https://phabricator.wikimedia.org/T229571)
[11:48:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] labpuppetmaster: add missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/527085 (https://phabricator.wikimedia.org/T229571) (owner: 10Jbond)
[11:50:42] <wikibugs>	 (03CR) 10Urbanecm: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[11:53:20] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527086 (https://phabricator.wikimedia.org/T228657)
[11:53:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527086 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[11:53:58] <icinga-wm>	 RECOVERY - Disk space on elastic1017 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1017&var-datasource=eqiad+prometheus/ops
[11:54:59] <marostegui>	 Amir1: I think we are kinda back to previous values now
[11:55:08] <marostegui>	 read_key handler still higher, but that's not bad
[11:55:36] <Amir1>	 coooooooooooooooool
[11:55:42] <marostegui>	 we did have some errors: https://grafana.wikimedia.org/d/000000273/mysql?panelId=10&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104&refresh=10s&from=now-3h&to=now
[11:55:55] <Amir1>	 Now we can move to items and finally kill wb_terms table
[11:56:02] <marostegui>	 <3
[11:56:16] <Amir1>	 it's fine I guess
[11:56:32] <marostegui>	 yeah, eveyrthing else looks similar to previous patterns
[11:56:38] <marostegui>	 and the query latency hasn't changed
[11:58:09] <wikibugs>	 (03CR) 10Urbanecm: Fix AddGroups/RemoveGroups for editor/autoreview (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518759 (https://phabricator.wikimedia.org/T226410) (owner: 10Reedy)
[12:03:37] <wikibugs>	 (03PS1) 10Ladsgroup: Switch property terms migration to WRITE_NEW on client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527087 (https://phabricator.wikimedia.org/T225053)
[12:12:20] <wikibugs>	 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10CDanis) That was scheduled maintenance in Centurylink's ticket 16820717, should be resolved as of about two hours ago.
[12:18:58] <marostegui>	 !log Rename math table on db1089 (enwiki) - T196055
[12:19:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:06] <stashbot>	 T196055: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055
[12:30:50] <wikibugs>	 (03CR) 10CDanis: "> Patch Set 1:" (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis)
[12:31:25] <wikibugs>	 (03PS2) 10CDanis: dbctl: require commit messages [software/conftool] - 10https://gerrit.wikimedia.org/r/526774
[12:36:39] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527092 (https://phabricator.wikimedia.org/T228657)
[12:38:13] <jbond42>	 !log add cp1008 to canary hosts https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/puppetmaster/frontend.yaml#L22
[12:38:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527092 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[12:42:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: site: lithium to spare [puppet] - 10https://gerrit.wikimedia.org/r/526980 (https://phabricator.wikimedia.org/T229557)
[12:43:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] site: lithium to spare [puppet] - 10https://gerrit.wikimedia.org/r/526980 (https://phabricator.wikimedia.org/T229557) (owner: 10Filippo Giunchedi)
[12:50:26] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[12:50:32] <icinga-wm>	 RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:51:18] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527095 (https://phabricator.wikimedia.org/T228657)
[12:52:07] <godog>	 lithium tls failure is expected, being decom'd
[12:52:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527095 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[12:57:33] <wikibugs>	 (03PS1) 10Jbond: puppetm,aster - canary_hosts: allow hosts top be IP addresses as well as fqdn [puppet] - 10https://gerrit.wikimedia.org/r/527096
[12:58:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetm,aster - canary_hosts: allow hosts top be IP addresses as well as fqdn [puppet] - 10https://gerrit.wikimedia.org/r/527096 (owner: 10Jbond)
[13:12:05] <wikibugs>	 (03PS1) 10Jbond: puppetmaster::frontend: allow canary hosts to be IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/527097
[13:12:48] <icinga-wm>	 PROBLEM - puppet last run on alcyone is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:13:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster::frontend: allow canary hosts to be IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/527097 (owner: 10Jbond)
[13:20:42] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Papaul)
[13:21:03] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Papaul) 05Open→03Resolved Complete
[13:28:45] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527100 (https://phabricator.wikimedia.org/T228657)
[13:29:46] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527100 (https://phabricator.wikimedia.org/T228657)
[13:30:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527100 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[13:32:48] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) a:05Papaul→03wiki_willy @jijiki I will talking to @wiki_willy to see what are our options on this.  @wiki_willy this system is out if warranty since April 2019 and we do have a proble...
[13:35:54] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Papaul) I checked IDRAC logs this morning, all looks good so far
[13:37:20] <wikibugs>	 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 2 others: Write incident report for jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Pchelolo) 05Open→03Resolved Report written. Please reopen if it's not sufficient.
[13:44:49] <icinga-wm>	 RECOVERY - puppet last run on alcyone is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:48:45] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) Hehe, yeah, I checked too. Let's give it till Monday Cross your fingers!
[13:57:24] <wikibugs>	 (03PS1) 10CDanis: dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070)
[13:59:59] <wikibugs>	 (03PS2) 10CDanis: dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070)
[14:00:04] <jouncebot>	 cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) dbctl to 100% deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1400).
[14:03:58] <wikibugs>	 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Marostegui) I have set up the proxy for m2 in codfw. I kn...
[14:08:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis)
[14:09:33] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] check_anycast_healthchecker, add sudo bird rights [puppet] - 10https://gerrit.wikimedia.org/r/526791 (owner: 10Ayounsi)
[14:12:56] <wikibugs>	 (03PS2) 10Ayounsi: check_anycast_healthchecker, add sudo bird rights [puppet] - 10https://gerrit.wikimedia.org/r/526791
[14:14:12] * cdanis taking over mwdebug2002 for a quick test
[14:17:18] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527109 (https://phabricator.wikimedia.org/T228657)
[14:17:54] * cdanis proceeding with rollout
[14:18:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527109 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[14:18:27] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527109 (https://phabricator.wikimedia.org/T228657)
[14:18:30] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis)
[14:19:50] <wikibugs>	 (03Merged) 10jenkins-bot: dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis)
[14:20:40] <wikibugs>	 (03CR) 10jenkins-bot: dbctl: to 100%! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527104 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis)
[14:22:03] <logmsgbot>	 !log cdanis@deploy1001 Synchronized wmf-config/etcd.php: Iaaa1238 dbctl to 100% of production! (duration: 00m 54s)
[14:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:38] <Amir1>	 I'm going to deploy something for a train blocker
[14:22:53] <cdanis>	 Amir1: ok, my deploy is done
[14:23:19] <Amir1>	 cdanis: thanks and congrats for doing this. It's awesome. I love it
[14:23:39] <cdanis>	 😊
[14:23:59] <wikibugs>	 (03PS7) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246)
[14:24:01] <apergos>	 I ran a test dump of a few revisions on a snapshot host. works fine. :-)
[14:24:23] <cdanis>	 apergos: well, the old db-foo.php configs are still correct, for the moment
[14:24:30] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Point m2-master.codfw to dbproxy2002 [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532)
[14:24:34] <elukey>	 cdanis: dbctl 100% ??
[14:24:37] <apergos>	 yes but we don't use those directly for anything
[14:24:38] <cdanis>	 elukey: 100%
[14:24:42] <apergos>	 it's all 'ask mediawiki'
[14:24:45] <mdholloway>	 Amir1: thanks for deploying that. i was just waiting for cdanis to finish, but i'm happy to let you do the honors :)
[14:24:48] <cdanis>	 apergos: that should be fine
[14:24:55] <apergos>	 exactly!
[14:25:06] <elukey>	 \o/
[14:25:14] <cdanis>	 elukey: now I have some time for ONFIRE things ;)
[14:25:22] <cdanis>	 (and thanks for doing what you did!)
[14:25:41] <Amir1>	 mdholloway: I can do something else if you're on it
[14:25:52] * Amir1 looks at his plate full of bugs
[14:25:58] <elukey>	 thank you for this work, now I can see marostegui partying now
[14:26:15] <mdholloway>	 Amir1: yeah, i can take it from here.
[14:26:22] <volans>	 and we're live! :D
[14:26:28] <Amir1>	 thanks then mdholloway 
[14:26:36] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Point m2-master.codfw to dbproxy2002 [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532)
[14:27:50] <cdanis>	 I just realized I've forgotten to sync-file on CommonSettings.php, but my only changes there were to comments, so I won't worry about it for now: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/527104/2/wmf-config/CommonSettings.php
[14:27:57] <Amir1>	 cdanis: you have no idea how useful it is for devs as well. Schema changes always took really long time. I basically stopped doing any schema change development because of the schema change process being so slow and hard for our DBAs and T191231
[14:27:58] <stashbot>	 T191231: RFC: Abstract schemas and schema changes - https://phabricator.wikimedia.org/T191231
[14:28:33] <volans>	 cdanis: do that too anyway to not leave changed files un-deployed
[14:28:34] <cdanis>	 I'm glad to hear it :) also want to thank _joe_ and volans as well, could not have done it without either of them
[14:28:41] <Amir1>	 I'm pretty sure it's ten times harder for our DBAs
[14:30:38] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] "Awesome! Thank you! This will allow gerrit2001 to start!" [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532) (owner: 10Marostegui)
[14:30:53] <Amir1>	 marostegui: btw. I will roll this change out for client wikis in Monday
[14:31:03] <wikibugs>	 (03CR) 10Alaa Sarhan: [C: 03+1] Switch property terms migration to WRITE_NEW on client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527087 (https://phabricator.wikimedia.org/T225053) (owner: 10Ladsgroup)
[14:31:16] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Remove RESTBase graphite alerts. [puppet] - 10https://gerrit.wikimedia.org/r/525856 (https://phabricator.wikimedia.org/T185089) (owner: 10Ppchelko)
[14:31:19] <marostegui>	 Amir1: Not sure I get what that means .)
[14:31:31] <Amir1>	 the reads will shift again but given that caches are already warmed up, I don't think it'll cause any noticeable difference
[14:32:12] <marostegui>	 Amir1: What do you mean with wiki clients? 
[14:32:17] <Amir1>	 marostegui: client wikis (=all wikis that read from wikidata) also access the term store
[14:32:23] <marostegui>	 Ah right, ok
[14:32:24] <marostegui>	 Sure
[14:32:38] <Amir1>	 it's probably half of the reads 
[14:33:01] <marostegui>	 You'll deploy that at the normal SWAT time?
[14:33:25] <Amir1>	 marostegui: yup
[14:33:33] <marostegui>	 Amir1: cool, I'll be around
[14:33:39] <Amir1>	 thanks
[14:34:20] <Lucas_WMDE>	 uh, shouldn’t they already access the new term store?
[14:34:34] <Lucas_WMDE>	 I thought the latest repo change was that the old store is no longer written to?
[14:35:06] <Amir1>	 Lucas_WMDE: what do you mean? Can you elaborate more?
[14:36:01] <Lucas_WMDE>	 I thought the change you made in the repo was from “read new write both” to “read new”
[14:36:16] <Lucas_WMDE>	 is that wrong? was it only changing from “read old write both” to “read new write both”?
[14:36:46] <Amir1>	 The change was “read old write both” to “read new write both”
[14:36:53] <Lucas_WMDE>	 ok good
[14:36:54] <Amir1>	 on wikidata, client still reads old
[14:36:56] <cdanis>	 Amir1: have you deployed yet?  is it okay if I do another quick comment-only sync-file?
[14:36:59] <Lucas_WMDE>	 then the clients can still read either
[14:37:08] <Amir1>	 cdanis: mdholloway is doing it
[14:37:28] <Amir1>	 Lucas_WMDE: yes
[14:37:34] <mdholloway>	 cdanis: go for it, i'll be waiting on jenkins for a bit
[14:37:37] <Amir1>	 we just didn't deploy it yet
[14:37:42] <cdanis>	 rgr, ty
[14:38:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust)
[14:38:39] <logmsgbot>	 !log cdanis@deploy1001 Synchronized wmf-config/CommonSettings.php: Iaaa1238 comment-only no-op change (dbctl to 100% of production!) (duration: 00m 55s)
[14:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:17] <icinga-wm>	 RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:39:34] <wikibugs>	 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10fgiunchedi) 05Open→03Resolved Disk replaced and is rebuilding, thanks @Papaul
[14:40:53] <brennen>	 mdholloway / Amir1 - thanks again for that.
[14:41:03] <wikibugs>	 10Operations, 10Traffic: fifo-log-tailer: evergrowing memory usage - https://phabricator.wikimedia.org/T229414 (10ema) 05Open→03Resolved The new `fifo-log-tailer` has now been running for one day and shows reasonable memory usage:  ` 14:39:53 ema@cp1080.eqiad.wmnet:~ $ ps u -q `pidof fifo-log-tailer`  USER...
[14:41:22] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Looks like a workable stopgap.  We could probably run it on a 5/10 min interval to recover more quickly if/when rsyslog fails." [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi)
[14:41:31] <mdholloway>	 brennen: np!
[14:42:01] <Amir1>	 brennen: I actually caused the issue so I don't think I should be thanked. Sorry for the trouble
[14:42:12] <mdholloway>	 Amir1: btw, where's the best place to chat with WMDE folks? i joined #wikidata a few days ago, but it looked pretty desolate, so i left
[14:42:22] <wikibugs>	 (03PS8) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246)
[14:42:53] <mdholloway>	 i guess this channel works, in any case
[14:43:21] <Lucas_WMDE>	 #wikimedia-de-tech is the usual channel
[14:43:47] <mdholloway>	 ah, thanks
[14:44:38] <wikibugs>	 (03CR) 10Ema: "> We are in the process of splitting RESTBase into two services in" [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[14:49:16] <wikibugs>	 (03CR) 10AndyRussG: [C: 04-1] "Hi! Mediawiki config changes should not be +2'd until the time of the deploy window when they'll be deployed. See: https://wikitech.wikime" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[14:50:42] <wikibugs>	 (03CR) 10AndyRussG: [C: 04-1] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[14:50:48] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T229283 (10fgiunchedi) 05Open→03Resolved Disk replaced and rebuilding
[14:50:51] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be2021 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2021&var-datasource=codfw+prometheus/ops
[14:50:58] <brennen>	 Lucas_WMDE: that appears to be the IRC channel for today that i didn't know existed but should have.  i seem to pick up at least one a week.
[14:51:17] <herron>	 !log performing rolling restarts of eqiad logstash cluster for security updates
[14:51:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:47] <icinga-wm>	 RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:53:19] <icinga-wm>	 RECOVERY - HP RAID on ms-be2021 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:54:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust)
[14:57:29] <wikibugs>	 (03PS1) 10BPirkle: Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099)
[15:00:56] <wikibugs>	 (03CR) 10AndyRussG: [C: 04-1] "Any thoughts on T225261? Maybe we could at least partly bring this inline with site-wide policy? Also, when site-wide CSP becomes enforced" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[15:02:01] <icinga-wm>	 PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:03:15] <icinga-wm>	 PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:03:57] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be2018 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018&var-datasource=codfw+prometheus/ops
[15:04:33] <icinga-wm>	 RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:05:56] <mdholloway>	 OK, looks like nothing's on for Puppet SWAT
[15:06:51] <mdholloway>	 herron: should i wait for you to finish before deploying the backports for the train blocker fix, or can that happen in parallel? (or are you already done?)
[15:07:19] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 104.65, 120.56, 74.82 https://wikitech.wikimedia.org/wiki/Swift
[15:07:29] <icinga-wm>	 RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:07:36] <herron>	 mdholloway: no, please carry on.  it’s slow online rolling restart with no impact expected
[15:07:44] <mdholloway>	 cool, thanks
[15:11:17] <marostegui>	 Amir1: did something happened at around 14:00 UTC? there was a spike, similar to the one when you deployed the change
[15:11:26] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1109&var-port=9104&from=now-6h&to=now&refresh=5s
[15:11:55] <Amir1>	 marostegui: I don't remember deploying anything
[15:11:57] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2018 is OK: OK - load average: 43.55, 73.68, 66.54 https://wikitech.wikimedia.org/wiki/Swift
[15:12:29] <marostegui>	 Amir1: any possible explanation for that?
[15:13:04] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/Wikibase: Do not warn about entity that was not found in WikiPageEntityRevisionLookup (T229482) (duration: 01m 20s)
[15:13:06] <cdanis>	 marostegui: that's too early for the dbctl 100%
[15:13:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:12] <stashbot>	 T229482: PHP Warning: Wikibase\Lib\Store\Sql\WikiPageEntityRevisionLookup::getEntityRevision: Entity not loaded - https://phabricator.wikimedia.org/T229482
[15:13:18] <marostegui>	 cdanis: yeah, it started at around 13:58 or so
[15:13:21] <Amir1>	 marostegui: did you check SAL?
[15:13:25] <marostegui>	 Amir1: yep
[15:14:11] <Amir1>	 I don't think I did anything. :/
[15:14:32] <marostegui>	 Amir1: Then it is very weird, as the spike it is very similar (also in duration) to the one we saw with the first deploy
[15:14:54] <marostegui>	 Amir1: Same type of select even :-/
[15:15:20] <Amir1>	 it can be that some caches got evicted, specially if it happens again
[15:15:28] <Amir1>	 then we need to do something about it
[15:15:45] <marostegui>	 Amir1: Yeah, definitely, those spikes aren't good if they happen that often, let's keep an eye on it
[15:15:53] <Amir1>	 sure
[15:15:57] <marostegui>	 I am pretty sure it is related, because it is exactly the same pattern on pretty much every graph: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1109&var-port=9104&from=now-6h&to=now&refresh=5s
[15:16:26] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/Wikibase: Do not warn about entity that was not found in WikiPageEntityRevisionLookup (T229482) (duration: 01m 14s)
[15:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:08] <wikibugs>	 (03CR) 10Ema: [C: 03+1] fifo_log_demux: Remove socket activation [puppet] - 10https://gerrit.wikimedia.org/r/527075 (owner: 10Vgutierrez)
[15:20:17] <brennen>	 mdholloway: i should be clear at this point to proceed with wmf.16 -> group1, yeah?
[15:21:20] <mdholloway>	 brennen: yep, should be clear
[15:21:30] <brennen>	 cool, thanks.
[15:22:12] <wikibugs>	 (03PS4) 10Mforns: analytics::refinery::job::data_purge Migrate webrequest timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862)
[15:23:42] <wikibugs>	 (03CR) 10Ema: fifo-log-demux: Keep attempting to read the FIFO after EOF (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/527013 (owner: 10Vgutierrez)
[15:24:21] <wikibugs>	 (03PS5) 10Mforns: analytics::refinery::job::data_purge Migrate webrequest timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862)
[15:25:07] <wikibugs>	 (03PS9) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246)
[15:27:10] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Juniper alarm active
[15:30:21] <wikibugs>	 (03PS1) 10Brennen Bearnes: group1 wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527128
[15:30:23] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527128 (owner: 10Brennen Bearnes)
[15:31:34] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10BBlack) 05Open→03Declined We had a quick discussion and a small informal vote and decided we don't really need this functionality (pinkunicorn) anymore, so we're going to retire it...
[15:31:37] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10BBlack)
[15:32:17] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527129 (https://phabricator.wikimedia.org/T228657)
[15:34:28] <wikibugs>	 (03CR) 10Mobrovac: "> Do you have a rough estimate of when "a bit down the line" could" [puppet] - 10https://gerrit.wikimedia.org/r/523928 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[15:34:40] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Jclark-ctr)
[15:35:21] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[15:36:26] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527128 (owner: 10Brennen Bearnes)
[15:39:05] <logmsgbot>	 !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.16
[15:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:00] <logmsgbot>	 !log brennen@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.16 (duration: 00m 54s)
[15:40:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527129 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[15:40:20] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 92.86% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[15:41:38] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: restrouter: Switch to event_service_uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/527130 (https://phabricator.wikimedia.org/T223953)
[15:44:42] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[15:45:34] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on ms-be2018 is CRITICAL: cluster=swift device=cciss,2 instance=ms-be2018:9100 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018&var-datasource=codfw+prometheus/ops
[15:45:46] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[15:47:52] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[15:47:55] <XioNoX>	 !log start codfw mgmt work - T228112
[15:48:02] <wikibugs>	 (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527128 (owner: 10Brennen Bearnes)
[15:48:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:03] <stashbot>	 T228112: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112
[15:48:40] <wikibugs>	 (03PS4) 10Filippo Giunchedi: facilities: add model to pdu monitoring [puppet] - 10https://gerrit.wikimedia.org/r/526633 (https://phabricator.wikimedia.org/T148541)
[15:50:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: restrouter: Add helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/526719 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[15:51:41] <wikibugs>	 (03PS2) 10Mforns: analytics::refinery::job::data_purge Migrate EL timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519685 (https://phabricator.wikimedia.org/T226862)
[15:51:44] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:52:08] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research-Backlog: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10EBernhardson)
[15:52:21] <wikibugs>	 (03CR) 10Ppchelko: [V: 03+2 C: 03+2] restrouter: Switch to event_service_uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/527130 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris)
[15:53:24] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:54:22] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[15:55:52] <wikibugs>	 10Operations, 10DBA, 10serviceops-radar, 10Performance-Team (Radar): phased rollout of dbctl, etcd-backed database configuration in Mediawiki - https://phabricator.wikimedia.org/T229070 (10Krinkle) >>! In T229070#5367389, @gerritbot wrote: > Change 525684 had a related patch set uploaded (by CDanis; owner:...
[15:55:57] <wikibugs>	 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) Hi all,  I just stumbled upon this task while investigating something else.   Its something I'm happy to progress however i wanted to consider if the...
[15:57:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] facilities: add model to pdu monitoring [puppet] - 10https://gerrit.wikimedia.org/r/526633 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[15:57:38] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[15:58:10] <librenms-wmf>	 04Critical Alert for device cr1-codfw.wikimedia.org - Juniper alarm active
[15:59:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:59:22] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:59:42] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[16:00:04] <jouncebot>	 godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:06] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10BBlack)
[16:01:28] <wikibugs>	 (03PS1) 10BBlack: acamar/achernar: site.pp cleanup post-decom [puppet] - 10https://gerrit.wikimedia.org/r/527133 (https://phabricator.wikimedia.org/T198286)
[16:01:29] <wikibugs>	 (03PS1) 10BBlack: eqiad cp decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/527134 (https://phabricator.wikimedia.org/T229586)
[16:02:24] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] acamar/achernar: site.pp cleanup post-decom [puppet] - 10https://gerrit.wikimedia.org/r/527133 (https://phabricator.wikimedia.org/T198286) (owner: 10BBlack)
[16:03:11] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527135 (https://phabricator.wikimedia.org/T228657)
[16:04:26] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10wiki_willy) @Papaul - if you can't find a spare from any of those decom servers, we can order it, since it's still a while before the 5yr mark.  Thanks Willy
[16:05:46] <XioNoX>	 !log power down msw1-codfw
[16:05:48] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[16:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:04] <wikibugs>	 (03PS2) 10Jbond: puppetmaster: add canary hosts to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/527135 (https://phabricator.wikimedia.org/T228657)
[16:09:02] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[16:09:29] <wikibugs>	 (03CR) 10Urbanecm: "> Patch Set 2: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[16:10:31] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) a:05Nuria→03None I've approved as manager, so moving back to unassigned for...
[16:11:11] <wikibugs>	 (03PS1) 10Krinkle: noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137
[16:11:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] "simple patch CI taking too long" [puppet] - 10https://gerrit.wikimedia.org/r/527135 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[16:12:13] <wikibugs>	 (03PS2) 10Krinkle: noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126)
[16:14:49] <wikibugs>	 (03PS2) 10Cwhite: logstash: update statsd exporter mappings and use exporter [puppet] - 10https://gerrit.wikimedia.org/r/526782 (https://phabricator.wikimedia.org/T205870)
[16:18:58] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[16:22:12] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[16:24:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/526782 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[16:25:34] <wikibugs>	 (03PS2) 10BBlack: eqiad cp decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/527134 (https://phabricator.wikimedia.org/T229586)
[16:32:51] <mdholloway>	 oh, i'm surprised you all could actually hear me in the meeting, i just realized i've got the wrong (mic-less) headphones on
[16:33:48] <mdholloway>	 i guess someone would have said something if it was an issue
[16:35:10] <wikibugs>	 (03CR) 10Krinkle: Bring up password change logging to the same standards as login logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza)
[16:37:26] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] eqiad cp decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/527134 (https://phabricator.wikimedia.org/T229586) (owner: 10BBlack)
[16:40:35] <mdholloway>	 sheesh, wrong channel
[16:40:49] <wikibugs>	 (03PS1) 10BBlack: pink unicorn death [dns] - 10https://gerrit.wikimedia.org/r/527145 (https://phabricator.wikimedia.org/T229586)
[16:44:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp1008.wikimedia.o...
[16:45:22] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] pink unicorn death [dns] - 10https://gerrit.wikimedia.org/r/527145 (https://phabricator.wikimedia.org/T229586) (owner: 10BBlack)
[16:45:43] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10Jclark-ctr)
[16:46:39] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Jclark-ctr)
[16:51:10] <wikibugs>	 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10greg) a:05greg→03mmodell >>! In T226044#5380942, @greg wrote: >>>! In T226044#5380759...
[16:54:14] <icinga-wm>	 PROBLEM - puppet last run on aqs1008 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:57:29] <wikibugs>	 10Operations, 10Puppet, 10Packaging: puppet fails to run in cp1008 under certain conditions - https://phabricator.wikimedia.org/T221343 (10BBlack) 05Open→03Declined Decom in T229586
[16:57:37] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10BBlack)
[16:58:06] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be2018 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018&var-datasource=codfw+prometheus/ops
[17:00:05] <jouncebot>	 cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1700).
[17:01:00] <wikibugs>	 10Operations, 10Puppet, 10observability: Use git commit id as "configuration version" for puppet - https://phabricator.wikimedia.org/T228854 (10jbond) I think this is a really good idea.  further after a  bit of investigation i think any arbitrary string can be used.  Later versions of the puppet documentati...
[17:02:10] <librenms-wmf>	 04Critical Alert for device msw1-codfw.mgmt.codfw.wmnet - Juniper alarm active
[17:09:36] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 62.01, 30.65, 21.36 https://wikitech.wikimedia.org/wiki/Application_servers
[17:09:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1073.eqiad.wmnet', 'cp1074.eqiad.wmnet', 'cp1072.eqiad.wmnet', 'cp...
[17:10:56] <wikibugs>	 (03Abandoned) 10BBlack: ncredir hostname and service IP [dns] - 10https://gerrit.wikimedia.org/r/295249 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack)
[17:11:52] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X
[17:11:55] <wikibugs>	 (03Abandoned) 10BBlack: redirects.dat - split non-canonical to separate section [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack)
[17:12:12] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 23.10, 28.77, 22.31 https://wikitech.wikimedia.org/wiki/Application_servers
[17:12:25] <wikibugs>	 (03Abandoned) 10BBlack: [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 (owner: 10BBlack)
[17:12:27] <wikibugs>	 (03PS3) 10Elukey: analytics::refinery::job::data_purge Migrate EL timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519685 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[17:12:29] <wikibugs>	 (03Abandoned) 10BBlack: VCL: grace-within-TTL [puppet] - 10https://gerrit.wikimedia.org/r/364606 (owner: 10BBlack)
[17:12:31] <wikibugs>	 (03CR) 10Gergő Tisza: Bring up password change logging to the same standards as login logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza)
[17:12:35] <wikibugs>	 (03CR) 10Dzahn: "boldly removing Giuseppe's -2 because we talked about it in meeting and agreed it's good to go now" [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn)
[17:13:12] <wikibugs>	 (03Abandoned) 10BBlack: Browser connection security warnings, again [puppet] - 10https://gerrit.wikimedia.org/r/407701 (owner: 10BBlack)
[17:13:55] <wikibugs>	 (03Abandoned) 10BBlack: [WIP] Move cache::canary from cp1008 to cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/451326 (owner: 10BBlack)
[17:14:36] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X
[17:14:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10BBlack) These are ready to go for dcops-level work!
[17:14:42] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:17:10] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device msw1-codfw.mgmt.codfw.wmnet recovered from Juniper alarm active
[17:19:24] <wikibugs>	 (03PS1) 10BBlack: Remove cache::canary stuff [puppet] - 10https://gerrit.wikimedia.org/r/527157
[17:20:08] <icinga-wm>	 RECOVERY - puppet last run on aqs1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:21:28] <wikibugs>	 (03Abandoned) 10BBlack: XXX note bad entries for conf200x in network::constants [puppet] - 10https://gerrit.wikimedia.org/r/465455 (owner: 10BBlack)
[17:23:34] <wikibugs>	 (03Abandoned) 10BBlack: CI check [dns] - 10https://gerrit.wikimedia.org/r/483198 (owner: 10BBlack)
[17:24:30] <wikibugs>	 (03Abandoned) 10BBlack: discovery-map remove [1/4]: remove refs [puppet] - 10https://gerrit.wikimedia.org/r/522110 (owner: 10BBlack)
[17:24:40] <wikibugs>	 (03Abandoned) 10BBlack: discovery-map remove [2/4]: remove ops/dns refs [dns] - 10https://gerrit.wikimedia.org/r/522113 (owner: 10BBlack)
[17:24:51] <wikibugs>	 (03Abandoned) 10BBlack: discovery-map remove [3/4]: stop deploying [puppet] - 10https://gerrit.wikimedia.org/r/522111 (owner: 10BBlack)
[17:24:59] <wikibugs>	 (03Abandoned) 10BBlack: discovery-map remove [4/4]: Remove completely [puppet] - 10https://gerrit.wikimedia.org/r/522112 (owner: 10BBlack)
[17:25:32] <icinga-wm>	 PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:26:29] <wikibugs>	 (03Abandoned) 10BBlack: Block POSTs to some wiki URLs [puppet] - 10https://gerrit.wikimedia.org/r/240389 (owner: 10Coren)
[17:28:11] <librenms-wmf>	 04Critical Alert for device cr1-codfw.wikimedia.org - Juniper alarm active got acknowledged
[17:28:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge Migrate EL timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519685 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[17:28:46] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Mayakp.wiki) Thanks @Nuria for the query and suggestion. I will use Jupyter and Beeline in the meantime. Please let me know whenever my HUE access is granted. https://wikitech.wikimed...
[17:29:19] <wikibugs>	 (03CR) 10BBlack: "Compiler looks sane: https://puppet-compiler.wmflabs.org/compiler1001/17697/" [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson)
[17:29:32] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:29:32] <wikibugs>	 (03PS10) 10BBlack: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson)
[17:29:41] <wikibugs>	 (03PS6) 10Elukey: analytics::refinery::job::data_purge Migrate webrequest timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[17:31:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge Migrate webrequest timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[17:32:16] <icinga-wm>	 RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:36:08] <twentyafterfour>	 !log running db dump on phab1003 (in tmux).  command: sudo ./bin/storage dump --output /srv/dumps/phabricator_db_20190801.sql.gz --compress
[17:36:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:10] <apergos>	 I'm almos tcurious enough about what that does to go look at the code... but not quite
[17:41:15] <apergos>	 (going on 9 pm)
[17:42:44] <bblack>	 !log disable puppet on lvs1014 + lvs1016 for cloudelastic LVS merge - T224324
[17:42:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:53] <stashbot>	 T224324: LB for cloudelastic - https://phabricator.wikimedia.org/T224324
[17:43:24] <wikibugs>	 (03PS11) 10BBlack: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson)
[17:44:29] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson)
[17:49:01] <wikibugs>	 (03PS1) 10BBlack: cloudelastic hieradata: fix parens mismatch typo [puppet] - 10https://gerrit.wikimedia.org/r/527165
[17:49:19] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] cloudelastic hieradata: fix parens mismatch typo [puppet] - 10https://gerrit.wikimedia.org/r/527165 (owner: 10BBlack)
[17:50:08] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:50:12] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:50:22] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:46] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:46] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:47] <bblack>	 ^ me, working on it
[17:51:58] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:56:00] <icinga-wm>	 PROBLEM - puppet last run on mw1270 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:58:51] <wikibugs>	 (03PS3) 10Ejegg: CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T225261)
[18:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1800).
[18:00:04] <jouncebot>	 bpirkle: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:13] <wikibugs>	 (03CR) 10Ejegg: "Thanks AndyRussG! I've invoked the powers of ctrl-c ctrl-v to bring this preview CSP more in line with the existing CSP as you suggest." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T225261) (owner: 10Ejegg)
[18:00:14] <bpirkle>	 I'm here
[18:00:15] <Urbanecm>	 I can SWAT today!
[18:00:51] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[18:02:15] <wikibugs>	 (03Merged) 10jenkins-bot: Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[18:02:17] <wikibugs>	 (03PS1) 10Volans: cumin: remove old scripts converted to cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886)
[18:02:24] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[18:02:31] <wikibugs>	 (03CR) 10jenkins-bot: Switch testwiki to read sessions from kask, with fallback to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527120 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[18:02:52] <Urbanecm>	 bpirkle, pulled onto mwdebug1002, if it's testable there
[18:03:14] <icinga-wm>	 PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:04:03] <bpirkle>	 Good to go
[18:04:41] <Urbanecm>	 syncing bpirkle 
[18:04:44] <icinga-wm>	 RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:05:04] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[18:05:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[18:06:10] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: 469c42d: Switch testwiki to read sessions from kask, with fallback to redis (T222099) (duration: 00m 55s)
[18:06:14] <Urbanecm>	 bpirkle, done
[18:06:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:19] <stashbot>	 T222099: Staging release of RESTBagOStuff using Kask - https://phabricator.wikimedia.org/T222099
[18:06:45] <bpirkle>	 Urbanecm: thank you
[18:06:48] <Urbanecm>	 yw bpirkle 
[18:06:51] <wikibugs>	 (03PS1) 10BBlack: Attempt to fix cloudelastic LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/527172
[18:08:48] <icinga-wm>	 RECOVERY - HHVM rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 75584 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[18:09:31] <wikibugs>	 (03CR) 10Volans: "I'll manually delete those files from the cumin hosts." [puppet] - 10https://gerrit.wikimedia.org/r/527170 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans)
[18:09:56] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:11:56] <icinga-wm>	 PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:13:06] <wikibugs>	 (03PS1) 10Volans: sre.hosts.upgrade-varnish: cp1008 decom cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/527173 (https://phabricator.wikimedia.org/T229586)
[18:13:24] <icinga-wm>	 RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:19:46] <wikibugs>	 (03PS1) 10Dzahn: acme_chief: replace cp1008 with cp1099 as authorized host [puppet] - 10https://gerrit.wikimedia.org/r/527175 (https://phabricator.wikimedia.org/T229586)
[18:20:31] <wikibugs>	 (03PS2) 10BBlack: Attempt to fix cloudelastic LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/527172
[18:20:35] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[18:21:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Attempt to fix cloudelastic LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/527172 (owner: 10BBlack)
[18:23:05] <icinga-wm>	 RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:23:21] <wikibugs>	 (03PS1) 10Dzahn: varnish wikimedia-backend.vcl: replace cp1008 with cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/527177 (https://phabricator.wikimedia.org/T229586)
[18:23:45] <wikibugs>	 (03PS3) 10BBlack: Attempt to fix cloudelastic LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/527172
[18:24:08] <wikibugs>	 (03CR) 10BBlack: "Compiler success! https://puppet-compiler.wmflabs.org/compiler1002/17710/cloudelastic1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/527172 (owner: 10BBlack)
[18:24:37] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[18:24:59] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Attempt to fix cloudelastic LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/527172 (owner: 10BBlack)
[18:25:03] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[18:27:53] <wikibugs>	 (03CR) 10Krinkle: "This diff is not a simple as I'd expect for adding the third-party domain. Perhaps these changes should be split?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T225261) (owner: 10Ejegg)
[18:29:34] <bblack>	 mutante: cp1099 is dying too, I'll look at whatever you're doing there in a sec...
[18:30:01] <bblack>	 !log lvs1016: puppet re-enabled, pybal restarted, cloudelastic deploy - T224324
[18:30:06] <mutante>	 bblack: ok, cool. no rush, i see you are busy 
[18:30:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:12] <stashbot>	 T224324: LB for cloudelastic - https://phabricator.wikimedia.org/T224324
[18:31:23] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 70 connections established with conf1004.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal
[18:32:19] <logmsgbot>	 !log bblack@puppetmaster1001 conftool action : set/pooled=yes; selector: name=^cloudelastic.*
[18:32:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:40] <wikibugs>	 (03PS1) 10Dzahn: puppetmaster::frontend: remove cp1008 as a canary host [puppet] - 10https://gerrit.wikimedia.org/r/527180 (https://phabricator.wikimedia.org/T229586)
[18:32:47] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:34:44] <wikibugs>	 (03PS8) 10Dzahn: parsoid::testing: add mediawiki appserver profiles to role [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069)
[18:34:57] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.84 and port 8643: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:35:11] <apergos>	 paged
[18:35:12] <bblack>	 ^ ignore that
[18:35:16] <apergos>	 ignored
[18:35:17] <mutante>	 known
[18:35:18] <volans>	 ack, ignoring
[18:35:19] <bblack>	 new service just defined, not in use
[18:35:44] <arturo>	 paged
[18:35:48] <arturo>	 ok
[18:36:22] <icinga-wm>	 ACKNOWLEDGEMENT - LVS HTTP IPv4 on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.84 and port 8643: Connection refused CDanis bblack new service just defined, not in use https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:36:26] <cdanis>	 doing something I think we should be in the habit of doing 🙃
[18:36:29] <icinga-wm>	 ACKNOWLEDGEMENT - LVS HTTP IPv4 on cloudelastic.wikimedia.org is CRITICAL: connect to address 208.80.154.84 and port 8643: Connection refused Brandon Black issues bringing up a new service, non-critical for now! https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:36:29] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb6_8443: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org are marked down but pooled: cloudelasticlb6_9643: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org 
[18:36:29] <icinga-wm>	 ut pooled: cloudelasticlb6_8243: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org are marked down but pooled: cloudelasticlb6_9443: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org are marked down but pooled: cloudelasticlb6_8643: Servers cloudelastic1003.wikimedia.org, cloudelastic1001.wikimedia.org are marked down but pooled Brandon Black issues bringing up a new service, non-critical
[18:36:29] <icinga-wm>	 /wikitech.wikimedia.org/wiki/PyBal
[18:36:38] <bblack>	 all related
[18:36:50] <apergos>	 uh hu
[18:36:51] <apergos>	 h
[18:36:55] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 82 connections established with conf1004.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal
[18:37:19] <mutante>	 cdanis: right! better getting a second SMS that says it's ACKed if you are not near laptop
[18:37:37] <mutante>	 (in the past the ACK did not create one but nowadays it does)
[18:38:49] <cdanis>	 mutante: I learned when I was updating https://wikitech.wikimedia.org/wiki/Incident_response for other reasons that ACKing has been policy for some time as well
[18:39:00] <cdanis>	 just something many of us never remember to do
[18:40:38] <mutante>	 cdanis: i agree very much. have been pushing for it
[18:40:43] <icinga-wm>	 PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:40:57] <wikibugs>	 (03PS4) 10Ejegg: CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019)
[18:41:00] <wikibugs>	 (03PS1) 10Ejegg: Make banner-preview CSP match normal CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527183 (https://phabricator.wikimedia.org/T225261)
[18:41:15] <mutante>	 then you can also look at the "unhandled problems" in browser 
[18:41:24] <cdanis>	 I'm also looking forward to go.dog's https://gerrit.wikimedia.org/r/c/operations/puppet/+/525536 being merged
[18:41:27] <mutante>	 and the ones still there are meaningful
[18:41:29] <cdanis>	 and will add an IRC highlight word when it does
[18:42:51] <mutante>	 ah, yea, that's nice
[18:43:47] <icinga-wm>	 RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:44:35] <icinga-wm>	 PROBLEM - configured eth on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[18:44:37] <icinga-wm>	 PROBLEM - Disk space on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops
[18:44:37] <icinga-wm>	 PROBLEM - MD RAID on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[18:44:43] <icinga-wm>	 PROBLEM - Check size of conntrack table on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[18:44:57] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:03] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:45:03] <mutante>	 ^ OOM because some script.. which killed NRPE server..
[18:45:05] <icinga-wm>	 PROBLEM - dhclient process on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[18:45:09] <mutante>	 unfortunately common
[18:45:16] <mutante>	 and so noisy
[18:45:19] <icinga-wm>	 PROBLEM - DPKG on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[18:45:37] <icinga-wm>	 PROBLEM - puppet last run on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:45:57] <mutante>	 yep, that was exactly it again
[18:46:13] <icinga-wm>	 RECOVERY - configured eth on stat1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[18:46:15] <icinga-wm>	 RECOVERY - Disk space on stat1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops
[18:46:15] <icinga-wm>	 RECOVERY - MD RAID on stat1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[18:46:19] <icinga-wm>	 RECOVERY - Check size of conntrack table on stat1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[18:46:31] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:39] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on stat1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:46:39] <icinga-wm>	 RECOVERY - dhclient process on stat1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[18:46:52] <wikibugs>	 (03CR) 10Ejegg: "Good call Krinkle. Now split." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[18:46:55] <icinga-wm>	 RECOVERY - DPKG on stat1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[18:47:01] <mutante>	 !log stat1004 - starting nagios-nrpe-server which got killed again - jbd2/md0-8 invoked oom-killer
[18:47:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:55] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg)
[18:48:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] parsoid::testing: add mediawiki appserver profiles to role [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn)
[18:51:11] <icinga-wm>	 RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:52:16] <mutante>	 !log scandium (parsoid testing) - added mw application server roles - puppet work / maintenance
[18:52:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - American version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T1900).
[19:08:12] <wikibugs>	 (03PS1) 10Brennen Bearnes: all wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527192
[19:08:14] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527192 (owner: 10Brennen Bearnes)
[19:09:21] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527192 (owner: 10Brennen Bearnes)
[19:09:41] <wikibugs>	 (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527192 (owner: 10Brennen Bearnes)
[19:12:34] <logmsgbot>	 !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.16
[19:12:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:35] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:20:38] <brennen>	 !log rolling back to wfm.15 on group1 and group2 while we investigate T229575
[19:20:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:47] <stashbot>	 T229575: phabricator server 500 error - https://phabricator.wikimedia.org/T229575
[19:25:21] <icinga-wm>	 PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:25:32] <wikibugs>	 (03PS1) 10Brennen Bearnes: Group1 and Group2 to php-1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527196
[19:25:34] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Group1 and Group2 to php-1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527196 (owner: 10Brennen Bearnes)
[19:26:04] <wikibugs>	 (03PS1) 10BBlack: cloudelastic: add mapped ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/527197
[19:26:37] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cloudelastic: add mapped ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/527197 (owner: 10BBlack)
[19:26:41] <wikibugs>	 (03Merged) 10jenkins-bot: Group1 and Group2 to php-1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527196 (owner: 10Brennen Bearnes)
[19:26:57] <wikibugs>	 (03CR) 10jenkins-bot: Group1 and Group2 to php-1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527196 (owner: 10Brennen Bearnes)
[19:30:01] <icinga-wm>	 RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:30:03] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis)
[19:31:36] <logmsgbot>	 !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 and group2 to 1.34.0-wmf.15
[19:31:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:00] <bblack>	 !log lvs1014 - puppetize and restart pybal for cloudelastic LVS - T224324
[19:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:08] <stashbot>	 T224324: LB for cloudelastic - https://phabricator.wikimedia.org/T224324
[19:36:13] <bblack>	 yeah one of the icinga checks is still borked
[19:36:20] <bblack>	 what a nightmare LVS service config is!
[19:37:17] <cdanis>	 bblack: I feel like it shouldn't be so bad :(
[19:37:24] <bblack>	 of course :)
[19:37:59] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: asw2-c-eqiad:xe-2/0/45 inbound interface errors - https://phabricator.wikimedia.org/T229612 (10ayounsi) p:05Triage→03Normal
[19:38:00] <bblack>	 it was poorly-factored way back when, but at least it was relatively tame and easy to understand (IMHO)
[19:38:37] <bblack>	 but in the years since, it's been abused and neglected I think during a bunch of attempts at refactoring it "better" and only getting halfway there, and suffered at the hands of various meta-changes to style standards that don't suit it well, too.
[19:38:46] <bblack>	 it's a complete mess right now
[19:39:00] <wikibugs>	 (03PS1) 10Brennen Bearnes: Reverting php symlink for revert of group1 and group2 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527200
[19:39:02] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Reverting php symlink for revert of group1 and group2 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527200 (owner: 10Brennen Bearnes)
[19:39:15] <twentyafterfour>	 !log finished phabricator database dump 
[19:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:28] <bblack>	 I have a lot of history with it, but haven't configured a complex new service in a while, and even I can't make any sense of a right way to do things that works for all known cases :P
[19:39:39] <cdanis>	 heh
[19:40:04] <logmsgbot>	 !log brennen@deploy1001 Synchronized php: Revert group1 and group2 back to 1.34.0-wmf.15 (duration: 00m 53s)
[19:40:07] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1014 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.241:9443, 2620:0:861:1:208:80:154:241:8243, 2620:0:861:1:208:80:154:241:9643, 208.80.154.241:8643, 208.80.154.241:9243, 2620:0:861:1:208:80:154:241:8443, 2620:0:861:1:208:80:154:241:9243, 2620:0:861:1:208:80:154:241:8643, 208.80.154.241:8243, 208.80.154.241:8443, 208.80.154.241:9643, 2620:0:861:1:208:80:154:2
[19:40:07] <icinga-wm>	 /wikitech.wikimedia.org/wiki/PyBal
[19:40:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:11] <wikibugs>	 (03Merged) 10jenkins-bot: Reverting php symlink for revert of group1 and group2 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527200 (owner: 10Brennen Bearnes)
[19:40:28] <bblack>	 that error will clear itself shortly
[19:40:32] <wikibugs>	 (03CR) 10jenkins-bot: Reverting php symlink for revert of group1 and group2 to 1.34.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527200 (owner: 10Brennen Bearnes)
[19:40:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:41:33] <bblack>	 this is the remaining issue now, and I suspect it's deep
[19:41:37] <bblack>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cloudelastic.wikimedia.org
[19:42:07] <bblack>	 note the check should be checking cloudelastic.wikimedia.org, but the IP it's complaining about is actually icinga1001's
[19:42:33] <bblack>	 I think something broke a while back with various auto-configured HTTPS checks for services, and in general many of them are now polling icinga itself instead of the intended target
[19:42:43] <bblack>	 but it's only "obvious" when you poll a port that icinga itself doesn't listen on :P
[19:43:01] <bblack>	 I fixed up something similar for a single case last week I think, but it didn't dawn on me that it could be widespread until now
[19:44:23] <cdanis>	 uhm that's scary if true
[19:45:38] <bblack>	 yup
[19:45:41] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[19:46:40] <wikibugs>	 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10ayounsi) This should probably wait on T219908. Whatever solution we find to configure IPv4 based on Netbox data, IPv6 should be the same.
[19:46:51] <bblack>	 hmmm maybe it's not widespread
[19:47:02] <bblack>	 I only see this example when looking at icinga1001's deployed config
[19:53:17] <wikibugs>	 (03PS1) 10BBlack: cloudelastic LVS: avoid "lb4" suffix [puppet] - 10https://gerrit.wikimedia.org/r/527204
[19:53:54] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cloudelastic LVS: avoid "lb4" suffix [puppet] - 10https://gerrit.wikimedia.org/r/527204 (owner: 10BBlack)
[19:57:19] <wikibugs>	 10Operations, 10ops-codfw, 10netops: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10ayounsi) 05Open→03Resolved This is done.
[19:57:21] <wikibugs>	 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi)
[19:57:31] <bblack>	 !log lvs1016 - restart pybal for slight LVS config change for cloudelastic - T224324
[19:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:41] <stashbot>	 T224324: LB for cloudelastic - https://phabricator.wikimedia.org/T224324
[19:57:46] <wikibugs>	 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi)
[19:58:57] <icinga-wm>	 PROBLEM - SSH on ms-be2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:00:33] <icinga-wm>	 RECOVERY - SSH on ms-be2018 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:03:05] <wikibugs>	 (03PS2) 10BBlack: Remove cache::canary stuff [puppet] - 10https://gerrit.wikimedia.org/r/527157
[20:04:01] <wikibugs>	 10Operations, 10netops, 10observability: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 (10ayounsi)  Service Request ID 2019-0801-0611 has been created.
[20:06:56] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Remove cache::canary stuff [puppet] - 10https://gerrit.wikimedia.org/r/527157 (owner: 10BBlack)
[20:07:32] <wikibugs>	 (03PS1) 10BBlack: more cp1008 cleanup around puppet [puppet] - 10https://gerrit.wikimedia.org/r/527209
[20:09:23] <bblack>	 cmon jerkins...
[20:11:19] <wikibugs>	 (03PS2) 10BBlack: more cp1008 cleanup around puppet [puppet] - 10https://gerrit.wikimedia.org/r/527209
[20:11:37] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] more cp1008 cleanup around puppet [puppet] - 10https://gerrit.wikimedia.org/r/527209 (owner: 10BBlack)
[20:11:40] <bblack>	 whatever jerkins
[20:17:56] <greg-g>	 :(
[20:18:06] <greg-g>	 bblack: give us a few more servers, plz
[20:20:57] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active, AS1299/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:21:44] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: update statsd exporter mappings and use exporter [puppet] - 10https://gerrit.wikimedia.org/r/526782 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[20:21:46] <bblack>	 that's probably not that hard!
[20:21:51] <wikibugs>	 (03PS3) 10Cwhite: logstash: update statsd exporter mappings and use exporter [puppet] - 10https://gerrit.wikimedia.org/r/526782 (https://phabricator.wikimedia.org/T205870)
[20:26:51] <bd808>	 greg-g: didn't we talk about that before annual planning? I though we did. :/
[20:29:04] <bblack>	 !log restart pybal on lvs1014
[20:29:09] <greg-g>	 we did, but then we also have a committment from SRE for a k8s cluster for CI, so we're OK, we just need to get to a point where we can move to it, technologically (aka, ditch zuul2)
[20:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:11] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:35:05] <bblack>	 the lvs -> icinga monitoring thing makes no sense :P
[20:38:40] <mutante>	 the part that it pages immediately and you can't add a new service without causing that?
[20:41:19] * Krinkle is going to deployhttps://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/527208/
[20:41:32] <Reedy>	 There is no application set to open the URL deployhttps://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/527208/.
[20:42:41] <Krinkle>	 Try 'Human.app' ? ;-)
[20:44:18] <bblack>	 mutante: no, the part where the monitoring definition comes out completely wrong
[20:44:34] <mutante>	 bblack: ugh :/
[20:44:43] <wikibugs>	 (03PS1) 10BBlack: cloudelastic LVS: try a different icinga check with explicit hostname [puppet] - 10https://gerrit.wikimedia.org/r/527214
[20:46:19] <mutante>	 !log puppetmaster: create mcrouter certs for scandium.eqiad.wmnet needed to make it an appserver (https://wikitech.wikimedia.org/wiki/Mcrouter#Generate_certs_for_a_new_host) (T228069)
[20:46:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:29] <stashbot>	 T228069: Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069
[20:47:42] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cloudelastic LVS: try a different icinga check with explicit hostname [puppet] - 10https://gerrit.wikimedia.org/r/527214 (owner: 10BBlack)
[20:47:51] <mutante>	 !log scandium - turning into an mw appserver
[20:47:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:25] <bblack>	 and also the part where if you define 6 services against one service hostname, you only get 1/6 checks defined :P
[20:50:09] <bblack>	 (because the horrible yaml-parsing ERB hack for it stores them in a map by hostname keys)
[20:50:42] <mutante>	 i see :(
[20:51:05] <bblack>	 but hey, at least the 1 check it made actually polls icinga1001 instead of the intended service :P
[20:51:49] <mutante>	 haha, oh man. must have broken during some refactoring i guess
[20:52:03] <bblack>	 yup
[20:52:31] <bblack>	 I think a whole lot of broken refactoring has happened to all LVS-related things (not that it was awesome before all of that, either)
[20:53:10] <wikibugs>	 (03CR) 10Dzahn: "hmm.. it installled Notice: /Stage[main]/Packages::Hhvm_dbg/Package[hhvm-dbg]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn)
[20:54:44] <brennen>	 Krinkle: is my reading correct that once that is deployed, train ought to be clear to proceed?
[20:54:50] * Krinkle staging on mwdebug1002
[20:54:58] <Krinkle>	 brennen: yep, certainly worth trying.
[20:55:23] <James_F>	 What could possibly break? ;-)
[20:56:43] <brennen>	 i'm sure we'll find out in due time.
[20:57:20] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.16/includes/Revision/RevisionRenderer.php: T229589 - 3f1b32e4db3698b8 (duration: 00m 50s)
[20:57:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:29] <stashbot>	 T229589: PHP Notice: Undefined property: MediaWiki\Revision\RevisionRenderer::$wikiId - https://phabricator.wikimedia.org/T229589
[20:57:32] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10BBlack)
[20:57:48] <bblack>	 the ultimate fix for anything: give up and file a ticket and hope someone else fixes it :P
[21:00:12] <mutante>	 ;)
[21:00:46] <wikibugs>	 (03PS1) 10Cwhite: hiera: fix statsd rules [puppet] - 10https://gerrit.wikimedia.org/r/527221 (https://phabricator.wikimedia.org/T205870)
[21:01:59] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: fix statsd rules [puppet] - 10https://gerrit.wikimedia.org/r/527221 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite)
[21:02:08] <wikibugs>	 (03PS2) 10Cwhite: hiera: fix statsd rules [puppet] - 10https://gerrit.wikimedia.org/r/527221 (https://phabricator.wikimedia.org/T205870)
[21:03:48] <wikibugs>	 (03Abandoned) 10Dzahn: varnish wikimedia-backend.vcl: replace cp1008 with cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/527177 (https://phabricator.wikimedia.org/T229586) (owner: 10Dzahn)
[21:03:59] <wikibugs>	 (03Abandoned) 10Dzahn: puppetmaster::frontend: remove cp1008 as a canary host [puppet] - 10https://gerrit.wikimedia.org/r/527180 (https://phabricator.wikimedia.org/T229586) (owner: 10Dzahn)
[21:04:10] <wikibugs>	 (03Abandoned) 10Dzahn: acme_chief: replace cp1008 with cp1099 as authorized host [puppet] - 10https://gerrit.wikimedia.org/r/527175 (https://phabricator.wikimedia.org/T229586) (owner: 10Dzahn)
[21:13:09] <wikibugs>	 (03PS1) 10Brennen Bearnes: Group1 and Group2 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527223
[21:13:11] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Group1 and Group2 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527223 (owner: 10Brennen Bearnes)
[21:14:47] <wikibugs>	 (03Merged) 10jenkins-bot: Group1 and Group2 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527223 (owner: 10Brennen Bearnes)
[21:14:49] <wikibugs>	 (03CR) 10jenkins-bot: Group1 and Group2 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527223 (owner: 10Brennen Bearnes)
[21:14:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "thanks !:):)" [dns] - 10https://gerrit.wikimedia.org/r/527114 (https://phabricator.wikimedia.org/T176532) (owner: 10Marostegui)
[21:16:58] <wikibugs>	 (03PS1) 10Brennen Bearnes: php symlink to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527224
[21:17:00] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] php symlink to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527224 (owner: 10Brennen Bearnes)
[21:17:40] <Amir1>	 bblack: Regarding "the ultimate fix for anything: give up and file a ticket and hope someone else fixes it :P". Wait until people start assigning tickets to you :D
[21:19:06] <wikibugs>	 (03Merged) 10jenkins-bot: php symlink to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527224 (owner: 10Brennen Bearnes)
[21:19:21] <wikibugs>	 (03CR) 10jenkins-bot: php symlink to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527224 (owner: 10Brennen Bearnes)
[21:22:29] <logmsgbot>	 !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 and group2 to 1.34.0-wmf.16
[21:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:53] <logmsgbot>	 !log brennen@deploy1001 Synchronized php: group1 and group2 to 1.34.0-wmf.16 (duration: 00m 46s)
[21:24:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:46] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi)
[21:34:24] <icinga-wm>	 PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:36:38] <wikibugs>	 (03PS1) 10Dzahn: parsoid::testing: fix Hiera key to NOT install hhvm [puppet] - 10https://gerrit.wikimedia.org/r/527226 (https://phabricator.wikimedia.org/T228069)
[21:37:24] <icinga-wm>	 RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:40:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:41:36] <icinga-wm>	 RECOVERY - HHVM rendering on mw1343 is OK: HTTP OK: HTTP/1.1 200 OK - 75698 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:46:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] parsoid::testing: fix Hiera key to NOT install hhvm [puppet] - 10https://gerrit.wikimedia.org/r/527226 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn)
[21:46:13] <wikibugs>	 (03PS2) 10Dzahn: parsoid::testing: fix Hiera key to NOT install hhvm [puppet] - 10https://gerrit.wikimedia.org/r/527226 (https://phabricator.wikimedia.org/T228069)
[21:48:32] <mutante>	 !log scandium - apt-get remove --purge hhvm* (T228069)
[21:48:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:41] <stashbot>	 T228069: Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069
[21:51:27] <mutante>	 i also have an issue with LVS, heh.  i use "has_lvs: false" in Hiera for this host, scandium but on each puppet run i see it changing /etc/default/wikimedia-lvs-realserver content
[21:51:40] <mutante>	 and what it changes is ..it removes the LVS_SERVICE_IPS=""
[21:51:47] <mutante>	 then next run it does it again
[21:53:35] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 03+1] noc: Add cross-dc navigation links to db.php footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) (owner: 10Krinkle)
[21:55:28] <wikibugs>	 (03PS1) 10DannyS712: Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607)
[21:58:11] <wikibugs>	 (03PS2) 10DannyS712: Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607)
[21:59:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) (owner: 10DannyS712)
[22:00:34] <wikibugs>	 (03PS3) 10DannyS712: Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607)
[22:05:39] <ebernhardson>	 group2 deploy showed a problem with an eventlogging schema, shipping a schema version bump
[22:05:42] <ebernhardson>	 brennen: ^
[22:08:13] <brennen>	 ack.
[22:08:16] <brennen>	 action required at this point?
[22:08:28] <brennen>	 ^ ebernhardson
[22:09:31] <ebernhardson>	 brennen: i'm deploying, just waiting on jenkins
[22:10:13] <brennen>	 cool, ty.
[22:13:21] <mutante>	 !log scandium apt-get remove --purge wikimedia-lvs-realserver (T228069)
[22:13:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:30] <stashbot>	 T228069: Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069
[22:13:40] <mutante>	 !log scandium apt-get autoremove
[22:13:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) (owner: 10DannyS712)
[22:17:42] <logmsgbot>	 !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/WikimediaEvents/extension.json: T229614: Update eventlogging schema version to resolve eventlogging errors in wmf.16 (duration: 00m 47s)
[22:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:50] <stashbot>	 T229614: tons of errors on eventlogging events  - https://phabricator.wikimedia.org/T229614
[22:32:54] <wikibugs>	 (03PS1) 10Dzahn: scandium: add has_lvs on node level in hiera [puppet] - 10https://gerrit.wikimedia.org/r/527232
[22:47:34] <logmsgbot>	 !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@5ebf93e]: Update mobileapps to 2ee48ab
[22:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:50:58] <wikibugs>	 (03PS1) 10CDanis: Revert "dbctl: diff PHP vs dbctl configs" [puppet] - 10https://gerrit.wikimedia.org/r/527245 (https://phabricator.wikimedia.org/T229070)
[22:52:08] <logmsgbot>	 !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@5ebf93e]: Update mobileapps to 2ee48ab (duration: 04m 34s)
[22:52:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:48] <wikibugs>	 (03PS2) 10Dzahn: scandium: add has_lvs on node level in hiera [puppet] - 10https://gerrit.wikimedia.org/r/527232
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190801T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:02:51] <wikibugs>	 (03PS3) 10Dzahn: parsoid::testing: temp. comment out php-restarts include [puppet] - 10https://gerrit.wikimedia.org/r/527232 (https://phabricator.wikimedia.org/T228069)
[23:08:45] <wikibugs>	 (03PS4) 10Dzahn: parsoid::testing: temp. comment out php-restarts include [puppet] - 10https://gerrit.wikimedia.org/r/527232 (https://phabricator.wikimedia.org/T228069)
[23:08:54] <wikibugs>	 (03PS1) 10Dzahn: scandium: move Hiera key to disable systemd monitor to role level [puppet] - 10https://gerrit.wikimedia.org/r/527253
[23:09:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17713/scandium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/527232 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn)
[23:10:23] <wikibugs>	 (03PS5) 10Dzahn: parsoid::testing: temp. comment out php-restarts include [puppet] - 10https://gerrit.wikimedia.org/r/527232 (https://phabricator.wikimedia.org/T228069)
[23:10:28] <logmsgbot>	 !log ebernhardson@deploy1001 Synchronized php-1.34.0-wmf.16/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: T229614: Pass proper types to eventlogging to resolve eventlogging errors in wmf.16 (duration: 00m 47s)
[23:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:37] <stashbot>	 T229614: tons of errors on eventlogging events  - https://phabricator.wikimedia.org/T229614
[23:10:40] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[23:16:29] <wikibugs>	 (03PS1) 10DannyS712: Add `autopatroller` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371)
[23:16:34] <wikibugs>	 (03PS1) 10Bstorm: sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527258 (https://phabricator.wikimedia.org/T229058)
[23:18:06] * Urbanecm is going to deploy a few things
[23:18:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) (owner: 10DannyS712)
[23:19:01] <wikibugs>	 (03PS2) 10DannyS712: Add `autopatrolled` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371)
[23:19:45] <wikibugs>	 (03Merged) 10jenkins-bot: Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) (owner: 10DannyS712)
[23:19:55] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, but don't forget to remove the files manually from the affected hosts ;)" [puppet] - 10https://gerrit.wikimedia.org/r/527245 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis)
[23:20:00] <wikibugs>	 (03CR) 10jenkins-bot: Add importing to english wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527229 (https://phabricator.wikimedia.org/T228607) (owner: 10DannyS712)
[23:22:58] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: cf01272: Add importing to english wikiquote (T228607) (duration: 00m 48s)
[23:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:23:06] <stashbot>	 T228607: Add importing to en.wq - https://phabricator.wikimedia.org/T228607
[23:26:10] <wikibugs>	 (03PS1) 10Urbanecm: Remove the "autoreview" user group from ru.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527264 (https://phabricator.wikimedia.org/T229596)
[23:26:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527264 (https://phabricator.wikimedia.org/T229596) (owner: 10Urbanecm)
[23:27:28] <wikibugs>	 (03Merged) 10jenkins-bot: Remove the "autoreview" user group from ru.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527264 (https://phabricator.wikimedia.org/T229596) (owner: 10Urbanecm)
[23:27:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add `autopatrolled` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) (owner: 10DannyS712)
[23:27:43] <wikibugs>	 (03CR) 10jenkins-bot: Remove the "autoreview" user group from ru.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527264 (https://phabricator.wikimedia.org/T229596) (owner: 10Urbanecm)
[23:27:53] <wikibugs>	 (03PS3) 10Urbanecm: Add `autopatrolled` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) (owner: 10DannyS712)
[23:28:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) (owner: 10DannyS712)
[23:29:16] <wikibugs>	 (03Merged) 10jenkins-bot: Add `autopatrolled` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) (owner: 10DannyS712)
[23:29:24] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: 8aca0eb: Remove the "autoreview" user group from ru.wikipedia (T229596) (duration: 00m 47s)
[23:29:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:35] <stashbot>	 T229596: Remove the "autoreview" user group from ru.wikipedia - https://phabricator.wikimedia.org/T229596
[23:29:37] <wikibugs>	 (03CR) 10jenkins-bot: Add `autopatrolled` group to az wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527257 (https://phabricator.wikimedia.org/T229371) (owner: 10DannyS712)
[23:29:39] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "LGTM but see also https://phabricator.wikimedia.org/T229631" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527137 (https://phabricator.wikimedia.org/T197126) (owner: 10Krinkle)
[23:30:32] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[23:30:52] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[23:31:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[23:32:00] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 819073a: Add `autopatrolled` group to az wikisource (T229371) (duration: 00m 49s)
[23:32:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:10] <stashbot>	 T229371: Add autopatroller user group to az.wikisource - https://phabricator.wikimedia.org/T229371
[23:32:38] <Urbanecm>	 !log Evening SWAT done
[23:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:10] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 75779 bytes in 0.564 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[23:35:22] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[23:35:38] <mutante>	 that was me.. had to merge
[23:35:42] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[23:35:58] <cdanis>	 mutante: did puppet-merge not automatically perform the labs merge first?
[23:36:32] <mutante>	 cdanis: i just hit submit on gerrit and then got distracted ..is all
[23:36:42] <cdanis>	 ahh okay :)
[23:38:20] <mutante>	 might be nice to have the bot log "user X ran puppet-merge" the way it does when we use conftool.. shrug
[23:40:58] <wikibugs>	 (03PS2) 10Dzahn: scandium: move Hiera key to disable systemd monitor to role level [puppet] - 10https://gerrit.wikimedia.org/r/527253
[23:41:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop https://puppet-compiler.wmflabs.org/compiler1002/17715/" [puppet] - 10https://gerrit.wikimedia.org/r/527253 (owner: 10Dzahn)
[23:46:06] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbctl: require commit messages [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis)
[23:48:45] <wikibugs>	 (03Merged) 10jenkins-bot: dbctl: require commit messages [software/conftool] - 10https://gerrit.wikimedia.org/r/526774 (owner: 10CDanis)