[00:33:41] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:41:03] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 54.39, 21.63, 12.29 https://wikitech.wikimedia.org/wiki/Application_servers [00:41:49] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 48.57, 20.89, 12.15 https://wikitech.wikimedia.org/wiki/Application_servers [00:42:41] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 25.45, 22.97, 13.86 https://wikitech.wikimedia.org/wiki/Application_servers [00:43:01] PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 68.95, 29.76, 17.48 https://wikitech.wikimedia.org/wiki/Application_servers [00:43:03] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 60.12, 27.87, 17.11 https://wikitech.wikimedia.org/wiki/Application_servers [00:43:11] PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 69.39, 31.78, 18.54 https://wikitech.wikimedia.org/wiki/Application_servers [00:43:13] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 67.67, 31.73, 18.29 https://wikitech.wikimedia.org/wiki/Application_servers [00:43:21] PROBLEM - High CPU load on API appserver on mw1279 is CRITICAL: CRITICAL - load average: 64.47, 31.28, 18.42 https://wikitech.wikimedia.org/wiki/Application_servers [00:43:27] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 20.95, 21.00, 13.21 https://wikitech.wikimedia.org/wiki/Application_servers [00:43:27] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 65.81, 32.18, 18.73 https://wikitech.wikimedia.org/wiki/Application_servers [00:43:31] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 66.68, 32.08, 18.88 https://wikitech.wikimedia.org/wiki/Application_servers [00:43:35] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:43:37] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 64.28, 27.15, 14.24 https://wikitech.wikimedia.org/wiki/Application_servers [00:44:19] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [00:44:19] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/ [00:44:19] itoring/recommendation_api [00:44:19] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [00:44:19] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/ [00:44:20] itoring/recommendation_api [00:44:21] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:21] PROBLEM - HHVM rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:23] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 62.43, 36.30, 20.89 https://wikitech.wikimedia.org/wiki/Application_servers [00:44:35] PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:37] PROBLEM - HHVM rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:44:39] PROBLEM - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:39] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [00:44:39] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [00:44:39] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:44:39] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [00:44:40] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedi [00:44:40] es/Monitoring/recommendation_api [00:44:43] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:44:45] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:47] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [00:44:48] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:44:49] PROBLEM - Nginx local proxy to apache on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:49] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translatio [00:44:49] med out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received h [00:44:49] ikimedia.org/wiki/Services/Monitoring/recommendation_api [00:44:51] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 27.70, 28.73, 18.60 https://wikitech.wikimedia.org/wiki/Application_servers [00:44:53] PROBLEM - Nginx local proxy to apache on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:53] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:44:57] PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:57] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a resp [00:44:57] : /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:45:00] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:45:02] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:45:02] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:45:02] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translatio [00:45:02] med out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received h [00:45:02] ikimedia.org/wiki/Services/Monitoring/recommendation_api [00:45:03] PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:45:03] PROBLEM - HHVM rendering on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:45:04] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [00:45:04] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:45:12] PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 64.33, 40.42, 23.31 https://wikitech.wikimedia.org/wiki/Application_servers [00:45:14] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:18] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:45:18] PROBLEM - HHVM rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:45:18] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:45:18] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:45:20] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good artic [00:45:20] ut before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:45:21] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:45:21] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:45:23] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:45:23] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was receiv [00:45:23] article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:45:23] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:23] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:23] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:24] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:24] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:25] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [00:45:25] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:26] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:26] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:27] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:27] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [00:45:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:45:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:46:01] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 81024 bytes in 0.633 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:01] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 81025 bytes in 1.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:11] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:46:11] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:46:11] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:46:11] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:46:13] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [00:46:13] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:46:13] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 23.80, 31.62, 21.02 https://wikitech.wikimedia.org/wiki/Application_servers [00:46:13] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:46:13] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:46:15] RECOVERY - Nginx local proxy to apache on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.145 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:17] PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:46:19] RECOVERY - HHVM rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 81025 bytes in 1.543 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:21] RECOVERY - Nginx local proxy to apache on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.913 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:21] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 63.95, 38.81, 22.96 https://wikitech.wikimedia.org/wiki/Application_servers [00:46:29] RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 19.75, 30.63, 20.88 https://wikitech.wikimedia.org/wiki/Application_servers [00:46:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:46:31] RECOVERY - Nginx local proxy to apache on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.241 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:31] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 21.50, 31.88, 21.58 https://wikitech.wikimedia.org/wiki/Application_servers [00:46:33] RECOVERY - Nginx local proxy to apache on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.566 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:35] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 81024 bytes in 0.452 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:35] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:46:37] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 1.149 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:39] RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 19.48, 31.92, 21.94 https://wikitech.wikimedia.org/wiki/Application_servers [00:46:41] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:43] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 81025 bytes in 1.868 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:43] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:46:45] RECOVERY - HHVM rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 81025 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:46:45] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:46:45] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:46:49] RECOVERY - High CPU load on API appserver on mw1279 is OK: OK - load average: 17.70, 30.05, 21.13 https://wikitech.wikimedia.org/wiki/Application_servers [00:46:57] RECOVERY - High CPU load on API appserver on mw1278 is OK: OK - load average: 17.59, 30.08, 21.22 https://wikitech.wikimedia.org/wiki/Application_servers [00:46:57] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:46:57] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:46:59] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:46:59] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:46:59] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:00] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:47:00] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:00] RECOVERY - HHVM rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 81072 bytes in 4.436 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:47:00] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:01] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 14.21, 26.81, 19.92 https://wikitech.wikimedia.org/wiki/Application_servers [00:47:02] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:02] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:02] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:47:03] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:03] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:04] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:04] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:05] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [00:47:05] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [00:47:06] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:47:06] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:47:10] RECOVERY - High CPU load on API appserver on mw1221 is OK: OK - load average: 23.45, 25.07, 16.12 https://wikitech.wikimedia.org/wiki/Application_servers [00:47:48] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:48] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:50] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:47:50] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:50] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:50] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [00:47:50] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:47:51] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:52] RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 660 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:06] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 81071 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:08] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:48:08] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 658 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:16] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:48:28] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:48:34] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:48:40] RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 11.23, 26.82, 21.49 https://wikitech.wikimedia.org/wiki/Application_servers [00:48:42] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:48:44] PROBLEM - High CPU load on API appserver on mw1313 is CRITICAL: CRITICAL - load average: 81.16, 47.40, 28.44 https://wikitech.wikimedia.org/wiki/Application_servers [00:49:26] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:49:42] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 16.24, 31.01, 23.39 https://wikitech.wikimedia.org/wiki/Application_servers [00:49:44] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 53.00, 35.51, 21.30 https://wikitech.wikimedia.org/wiki/Application_servers [00:49:46] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:49:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:50:00] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 49.50, 31.03, 18.90 https://wikitech.wikimedia.org/wiki/Application_servers [00:50:20] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:50:22] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:50:22] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:50:32] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 55.03, 37.30, 21.95 https://wikitech.wikimedia.org/wiki/Application_servers [00:50:44] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 49.51, 32.10, 19.61 https://wikitech.wikimedia.org/wiki/Application_servers [00:50:48] PROBLEM - puppet last run on cloudvirt1009 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:50:52] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:50:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:51:06] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.49, 35.36, 21.98 https://wikitech.wikimedia.org/wiki/Application_servers [00:51:12] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:51:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:51:42] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:51:58] RECOVERY - High CPU load on API appserver on mw1313 is OK: OK - load average: 19.11, 34.26, 26.93 https://wikitech.wikimedia.org/wiki/Application_servers [00:52:00] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [00:54:36] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 11.37, 24.66, 20.97 https://wikitech.wikimedia.org/wiki/Application_servers [00:54:50] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 11.75, 23.86, 19.65 https://wikitech.wikimedia.org/wiki/Application_servers [00:55:20] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 7.77, 22.80, 20.40 https://wikitech.wikimedia.org/wiki/Application_servers [00:55:34] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 9.68, 21.50, 18.93 https://wikitech.wikimedia.org/wiki/Application_servers [00:57:34] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 8.66, 23.97, 22.62 https://wikitech.wikimedia.org/wiki/Application_servers [01:01:44] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:11:38] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:15:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [01:18:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [01:18:52] RECOVERY - puppet last run on cloudvirt1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:30:34] PROBLEM - puppet last run on mw1340 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:39:10] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:00] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:56] PROBLEM - puppet last run on cloudvirt1024 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:58:34] RECOVERY - puppet last run on mw1340 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:14:02] RECOVERY - puppet last run on cloudvirt1024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:38:22] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [02:40:02] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:56:48] 3~/win 30 [04:54:42] (03PS3) 10Marostegui: maintain-views: Remove afl_log_id [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) [04:57:10] (03CR) 10Marostegui: [C: 03+2] maintain-views: Remove afl_log_id [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [05:05:28] !log Remove db1072 from tendril and zarcillo T228956 [05:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:37] T228956: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 [05:10:58] (03PS1) 10Marostegui: mariadb: Decommission db1072 [puppet] - 10https://gerrit.wikimedia.org/r/526007 (https://phabricator.wikimedia.org/T228956) [05:13:18] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) Another crash just happened: ` [Mon Jul 29 04:55:14 2019] mce: [Hardware Error]: Machine check events logged [Mon Jul 29 04:55:14 2019] mce: Uncorrected hardware memory error in user-acc... [05:13:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1072 [puppet] - 10https://gerrit.wikimedia.org/r/526007 (https://phabricator.wikimedia.org/T228956) (owner: 10Marostegui) [05:14:38] 10Operations, 10DBA, 10decommission, 10Patch-For-Review: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Marostegui) [05:15:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Marostegui) a:05Marostegui→03RobH This host is ready for #dc-ops to do the final decommission steps [05:15:52] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:18:11] !log Drop Drop abuse_filter_log.afl_log_id from s7 codfw with replication (this will cause lag in s7 codfw) - T226851 [05:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:19] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [05:29:35] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526009 (https://phabricator.wikimedia.org/T227062) [05:31:05] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526009 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:32:04] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526009 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:32:19] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526009 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:33:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 in preparation for Tuesday 30th failover in s8 (duration: 00m 54s) [05:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:03] (03PS4) 10Marostegui: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/524411 (https://phabricator.wikimedia.org/T227062) [05:34:06] PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:34:14] (03PS3) 10Marostegui: db-eqiad.php: Set s8 (wikidata) into read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) [05:35:14] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:35:15] (03PS1) 10Marostegui: db-eqiad.php: Promote db1104 to s8 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) [05:35:28] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:53:45] (03PS1) 10Marostegui: wmnet: Update CNAME for s8 master [dns] - 10https://gerrit.wikimedia.org/r/526013 (https://phabricator.wikimedia.org/T227062) [05:54:08] RECOVERY - Check systemd state on elastic2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:44] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/526013 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [06:17:17] (03PS2) 10Elukey: role::analytics_test_cluster::coordinator: add el refine job [puppet] - 10https://gerrit.wikimedia.org/r/525824 (https://phabricator.wikimedia.org/T226698) [06:18:28] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: add el refine job [puppet] - 10https://gerrit.wikimedia.org/r/525824 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [06:24:42] (03PS1) 10Elukey: sre.hadoop.roll-restart-workers.py: increase sleep time for HDFS [cookbooks] - 10https://gerrit.wikimedia.org/r/526014 (https://phabricator.wikimedia.org/T229003) [06:24:52] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [06:25:20] RECOVERY - Host mw1280 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [06:28:22] (03CR) 10Elukey: [C: 03+2] sre.hadoop.roll-restart-workers.py: increase sleep time for HDFS [cookbooks] - 10https://gerrit.wikimedia.org/r/526014 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [06:30:24] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [06:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:59] <_joe_> what happened to mw1280?? [06:31:28] <_joe_> !log restarting nrpe on restbase-dev1006 T224260 [06:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:35] T224260: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 [06:31:44] RECOVERY - Disk space on restbase-dev1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1006&var-datasource=eqiad+prometheus/ops [06:31:58] RECOVERY - DPKG on restbase-dev1006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:32:00] RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:06] RECOVERY - Check size of conntrack table on restbase-dev1006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:32:16] PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [06:32:18] RECOVERY - configured eth on restbase-dev1006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:32:30] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:36] _joe_ librenms seems not reporting anything weird for the interface, but from https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=mw1280&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver I can see unusual disk activity [06:32:48] RECOVERY - Check whether ferm is active by checking the default input chain on restbase-dev1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:32:54] RECOVERY - MD RAID on restbase-dev1006 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:33:14] RECOVERY - dhclient process on restbase-dev1006 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:33:14] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:33:28] PROBLEM - puppet last run on elastic2054 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:42] <_joe_> elukey: it's one of the hosts in the dreaded rows [06:33:45] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 81104 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:33:48] <_joe_> so I am expecting trouble [06:36:26] <_joe_> !log restarted coherence report on netmon1002, it failed earlier this morning [06:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:19] <_joe_> !log restarted php7.2 on mwdebug1002, low opcache [06:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:32] <_joe_> I love working as the icinga janitor [06:38:12] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:39:09] (03PS2) 10Alexandros Kosiaris: Decommission old jessie-based ORES pool counters [puppet] - 10https://gerrit.wikimedia.org/r/524162 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff) [06:39:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] Decommission old jessie-based ORES pool counters [puppet] - 10https://gerrit.wikimedia.org/r/524162 (https://phabricator.wikimedia.org/T227640) (owner: 10Muehlenhoff) [06:45:17] !log poweroff orespoolcounter{1,2}00{1,2} for removal T227640 [06:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:24] T227640: Migrate ORES pool counters to Buster - https://phabricator.wikimedia.org/T227640 [06:48:22] RECOVERY - Check the NTP synchronisation status of timesyncd on restbase-dev1006 is OK: OK: synced at Mon 2019-07-29 06:48:20 UTC. https://wikitech.wikimedia.org/wiki/NTP [06:48:22] RECOVERY - IPMI Sensor Status on restbase-dev1006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [07:00:06] RECOVERY - puppet last run on elastic2054 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:09:40] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:12:48] (03PS1) 10Marostegui: db2123: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526026 (https://phabricator.wikimedia.org/T228969) [07:13:42] (03PS2) 10Marostegui: db2123: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526026 (https://phabricator.wikimedia.org/T228969) [07:14:22] (03CR) 10Marostegui: [C: 03+2] db2123: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526026 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [07:18:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [07:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:37] \o/ [07:18:49] 54 hosts restarted with spicerack! [07:22:24] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:23:53] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 81090 bytes in 1.449 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:26:26] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526092 (https://phabricator.wikimedia.org/T227565) [07:27:20] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526092 (https://phabricator.wikimedia.org/T227565) (owner: 10Marostegui) [07:28:13] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526092 (https://phabricator.wikimedia.org/T227565) (owner: 10Marostegui) [07:28:53] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526092 (https://phabricator.wikimedia.org/T227565) (owner: 10Marostegui) [07:29:32] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2038 from config T221533 (duration: 00m 50s) [07:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:40] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [07:30:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2038 from config T221533 (duration: 00m 46s) [07:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:26] elukey: yay! :D [07:36:14] volans: \o/ [07:36:24] (03PS2) 10Marostegui: db-eqiad.php: Promote db1104 to s8 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) [07:36:35] (03PS4) 10Marostegui: db-eqiad.php: Set s8 (wikidata) into read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) [07:47:01] (03PS7) 10Ema: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) [07:47:52] (03Abandoned) 10Ema: ATS: add-vary Lua plugin [puppet] - 10https://gerrit.wikimedia.org/r/525815 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [07:49:16] 10Operations, 10Puppet, 10Continuous-Integration-Config: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10akosiaris) >>! In T208566#5371633, @hashar wrote: > The Gemfile had Puppet 4.8.2 to match the version provided by Debian Jessie: Stretch, not jessie. jessie shipped wit... [07:49:47] !log elastic@eqiad force recovery of failed shards (eswiki stuck) [07:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:18] (03CR) 10Ema: "Successfully tested in labs:" [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) (owner: 10Ema) [08:03:18] (03CR) 10Volans: "Replied to question and other comments inline" (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/525776 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [08:04:22] (03PS1) 10Marostegui: mariadb: Provision db2128 into s5 [puppet] - 10https://gerrit.wikimedia.org/r/526101 (https://phabricator.wikimedia.org/T228969) [08:06:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2128 into s5 [puppet] - 10https://gerrit.wikimedia.org/r/526101 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [08:08:00] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:08:59] (03CR) 10Volans: Add sre.kafka.roll-restart-brokers.py (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [08:09:06] (03CR) 10Elukey: Add sre.druid.roll-restart-workers.py (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/525776 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [08:12:45] (03PS4) 10Elukey: Add sre.druid.roll-restart-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525776 (https://phabricator.wikimedia.org/T229003) [08:16:25] !log Drop abuse_filter_log.afl_log_id in s7 eqiad - T226851 [08:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:33] T226851: Drop abuse_filter_log.afl_log_id in production - https://phabricator.wikimedia.org/T226851 [08:19:13] (03PS8) 10Ema: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) [08:19:15] (03PS5) 10Ema: ATS: Vary-slotting for X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/525310 (https://phabricator.wikimedia.org/T227432) [08:19:45] (03CR) 10Volans: [C: 03+1] "LGTM, one nit question inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [08:20:11] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/526013 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [08:22:14] (03CR) 10Marostegui: [C: 04-2] db-eqiad.php: Promote db1104 to s8 master (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [08:23:13] (03CR) 10Volans: [C: 03+1] "LGTM" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/525776 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [08:23:48] (03CR) 10Elukey: [C: 03+2] Add sre.druid.roll-restart-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525776 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [08:24:06] (03CR) 10Volans: [C: 03+1] db-eqiad.php: Promote db1104 to s8 master (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [08:28:35] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:29:18] (03PS1) 10Alexandros Kosiaris: Bump CI puppet Gem version to 5.5.10 [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) [08:30:22] (03CR) 10jerkins-bot: [V: 04-1] Bump CI puppet Gem version to 5.5.10 [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) (owner: 10Alexandros Kosiaris) [08:31:39] (03CR) 10Fsero: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/525861 (https://phabricator.wikimedia.org/T229051) (owner: 10Ottomata) [08:31:51] (03PS2) 10Fsero: Allow eventgate-analytics to reach schema.svc.{eqiad,codfw}.wmnet:8190 [puppet] - 10https://gerrit.wikimedia.org/r/525861 (https://phabricator.wikimedia.org/T229051) (owner: 10Ottomata) [08:32:42] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [08:32:48] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hadoop.roll-restart-workers (exit_code=97) [08:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:09] pebcak --^ [08:34:05] (03CR) 10Ema: "> LGTM overall, is this going to reload nginx when deployed ?" [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [08:34:48] (03CR) 10Fsero: [V: 03+2 C: 03+2] Allow eventgate-analytics to reach schema.svc.{eqiad,codfw}.wmnet:8190 [deployment-charts] - 10https://gerrit.wikimedia.org/r/525860 (https://phabricator.wikimedia.org/T229051) (owner: 10Ottomata) [08:35:47] !log temp stop puppet on cp hosts to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/525259 [08:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:16] (03PS2) 10Filippo Giunchedi: varnish: remove varnishreqstats and varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/525259 (https://phabricator.wikimedia.org/T184942) [08:36:44] (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: remove varnishreqstats and varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/525259 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [08:41:02] (03PS1) 10Elukey: sre.druid.roll-restart-workers.py: unpack commands in run_async [cookbooks] - 10https://gerrit.wikimedia.org/r/526105 [08:41:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, see small nitpick comment (optional)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) (owner: 10Ema) [08:43:06] (03PS2) 10Elukey: profile::mediawiki::mcrouter_wancache: set async behavior as default [puppet] - 10https://gerrit.wikimedia.org/r/525224 (https://phabricator.wikimedia.org/T225642) [08:43:45] (03CR) 10Filippo Giunchedi: [C: 03+1] tlsproxy: toggle dynamic ssl_buffer_size settings [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [08:44:07] (03CR) 10Elukey: [C: 03+1] tlsproxy: toggle dynamic ssl_buffer_size settings [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [08:45:28] (03CR) 10Elukey: [C: 03+2] profile::mediawiki::mcrouter_wancache: set async behavior as default [puppet] - 10https://gerrit.wikimedia.org/r/525224 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [08:46:39] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [08:47:02] !log set mcrouter async behavior for codfw replication to all mw app/api servers (changes will be picked up when puppet runs on the hosts) - T225642 [08:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:09] T225642: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 [08:49:35] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi All varnish statsd daemons have been retired, "maps performance" dashboard is missing per-backend A... [08:49:37] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: Migrate all metrics originated by PoPs from statsd to Prometheus - https://phabricator.wikimedia.org/T220116 (10fgiunchedi) [08:49:38] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/526105 (owner: 10Elukey) [08:49:41] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199 (10fgiunchedi) [08:49:56] (03CR) 10Elukey: [C: 03+2] sre.druid.roll-restart-workers.py: unpack commands in run_async [cookbooks] - 10https://gerrit.wikimedia.org/r/526105 (owner: 10Elukey) [08:50:27] (03PS1) 10Fsero: helmfile,k8s,calico: bug: easier and safer policy embedding [deployment-charts] - 10https://gerrit.wikimedia.org/r/526107 [08:50:42] 10Operations, 10DBA, 10serviceops: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10akosiaris) [08:50:44] (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile,k8s,calico: bug: easier and safer policy embedding [deployment-charts] - 10https://gerrit.wikimedia.org/r/526107 (owner: 10Fsero) [08:50:59] 10Operations, 10DBA, 10serviceops: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10akosiaris) p:05Triage→03Normal [08:51:26] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [08:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:55] (03PS2) 10Hashar: Bump CI puppet Gem version to 5.5.10 [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) (owner: 10Alexandros Kosiaris) [08:55:42] !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers [08:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:11] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) (owner: 10Alexandros Kosiaris) [08:58:34] (03PS3) 10Alexandros Kosiaris: Bump CI puppet Gem version to 5.5.10 [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) [08:58:36] (03PS1) 10Alexandros Kosiaris: helmfile_sal: Detect sudo usage for logging [puppet] - 10https://gerrit.wikimedia.org/r/526108 [08:59:02] (03CR) 10jerkins-bot: [V: 04-1] Bump CI puppet Gem version to 5.5.10 [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) (owner: 10Alexandros Kosiaris) [08:59:17] (03CR) 10jerkins-bot: [V: 04-1] helmfile_sal: Detect sudo usage for logging [puppet] - 10https://gerrit.wikimedia.org/r/526108 (owner: 10Alexandros Kosiaris) [09:00:15] (03CR) 10Elukey: Add sre.kafka.roll-restart-brokers.py (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [09:00:41] (03PS1) 10Giuseppe Lavagetto: Tests: discover and run tox in module directories [puppet] - 10https://gerrit.wikimedia.org/r/526109 [09:00:43] (03PS1) 10Giuseppe Lavagetto: envoyproxy: create module, add tls terminator definition [puppet] - 10https://gerrit.wikimedia.org/r/526110 [09:02:17] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: create module, add tls terminator definition [puppet] - 10https://gerrit.wikimedia.org/r/526110 (owner: 10Giuseppe Lavagetto) [09:04:01] (03PS9) 10Ema: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) [09:04:28] (03CR) 10Ema: ATS: Vary-slotting for PHP7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) (owner: 10Ema) [09:05:46] (03CR) 10Ema: [C: 03+2] ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) (owner: 10Ema) [09:06:15] (03PS6) 10Ema: ATS: Vary-slotting for X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/525310 (https://phabricator.wikimedia.org/T227432) [09:06:24] (03PS2) 10Alexandros Kosiaris: helmfile_sal: Detect sudo usage for logging [puppet] - 10https://gerrit.wikimedia.org/r/526108 [09:06:42] (03PS1) 10Elukey: sre.hadoop.roll-restart-workers.py: ensure durability of the shell [cookbooks] - 10https://gerrit.wikimedia.org/r/526111 (https://phabricator.wikimedia.org/T229003) [09:09:00] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10Marostegui) [09:14:47] (03CR) 10Ema: [C: 03+2] ATS: Vary-slotting for X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/525310 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:14:54] (03CR) 10Elukey: [C: 03+2] sre.hadoop.roll-restart-workers.py: ensure durability of the shell [cookbooks] - 10https://gerrit.wikimedia.org/r/526111 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [09:16:16] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10hashar) >>! In T208566#5372005, @akosiaris wrote: >>>! In T208566#5371633, @hashar wrote: >> The Gemfile had Puppet 4.8.2 to match the version prov... [09:17:32] (03CR) 10Volans: Add sre.kafka.roll-restart-brokers.py (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [09:20:09] (03CR) 10Hashar: "It is missing a single quote." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) (owner: 10Alexandros Kosiaris) [09:21:35] (03PS4) 10Ema: ATS: do not cache Authorization responses [puppet] - 10https://gerrit.wikimedia.org/r/525548 (https://phabricator.wikimedia.org/T227432) [09:21:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) [09:21:44] volans: \o/ [09:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:39] yay!!!! [09:22:52] !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers [09:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:02] (this is the other cluster [09:24:05] (03CR) 10Ema: [C: 03+2] ATS: do not cache Authorization responses [puppet] - 10https://gerrit.wikimedia.org/r/525548 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:24:54] !log elukey@cumin1001 END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) [09:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:17] ah yes I am stupid [09:25:31] depool/pool are not deployed on druid100[4-6] [09:26:17] (03CR) 10Volans: sre.hadoop.roll-restart-workers.py: ensure durability of the shell (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/526111 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [09:26:32] nit^3 elukey ;) ^^^ [09:27:06] (03PS5) 10Ema: ATS: save and restore CC/Expires when forcing no-cache [puppet] - 10https://gerrit.wikimedia.org/r/525554 (https://phabricator.wikimedia.org/T227432) [09:28:15] (03PS4) 10Alexandros Kosiaris: Bump CI puppet Gem version to 5.5.10 [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) [09:28:17] (03PS3) 10Alexandros Kosiaris: helmfile_sal: Detect sudo usage for logging [puppet] - 10https://gerrit.wikimedia.org/r/526108 [09:28:19] (03PS1) 10Alexandros Kosiaris: Remove informational default-kubernetes-policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/526114 [09:29:18] volans: ack! [09:29:48] no hurry can stay there for a while :D [09:30:11] 10Operations, 10Traffic: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10jijiki) Just a heads up, we are planning to start migrating API servers to serve only via PHP7. For the time being, we have one in each DC. [09:33:28] (03PS1) 10Elukey: Introduce role::statistics::explorer::gpu for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/526115 [09:34:07] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) (owner: 10Alexandros Kosiaris) [09:34:14] (03PS2) 10Giuseppe Lavagetto: Tests: discover and run tox in module directories [puppet] - 10https://gerrit.wikimedia.org/r/526109 [09:34:16] (03PS2) 10Giuseppe Lavagetto: envoyproxy: create module, add tls terminator definition [puppet] - 10https://gerrit.wikimedia.org/r/526110 [09:34:18] (03PS1) 10Giuseppe Lavagetto: tox: exclude mitaka admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526116 [09:35:50] (03Abandoned) 10Elukey: Introduce role::statistics::explorer::gpu for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/526115 (owner: 10Elukey) [09:36:10] (03CR) 10Alexandros Kosiaris: "> It is missing a single quote." [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) (owner: 10Alexandros Kosiaris) [09:36:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] tox: exclude mitaka admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526116 (owner: 10Giuseppe Lavagetto) [09:38:09] (03PS1) 10Filippo Giunchedi: prometheus: split puppet failed runs metrics [puppet] - 10https://gerrit.wikimedia.org/r/526118 (https://phabricator.wikimedia.org/T228878) [09:41:28] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: split puppet failed runs metrics [puppet] - 10https://gerrit.wikimedia.org/r/526118 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [09:44:33] (03PS1) 10Elukey: Set role::statistics::explorer to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/526119 [09:44:56] 10Operations, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, 10Traffic, and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) >>! In T226840#5366460, @TheDJ wrote: > Is this fixed now ? The... [09:45:13] (03PS2) 10Elukey: Set role::statistics::explorer to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/526119 [09:45:41] (03Abandoned) 10Elukey: Set role::statistics::explorer to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/526119 (owner: 10Elukey) [09:45:45] (03Restored) 10Elukey: Introduce role::statistics::explorer::gpu for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/526115 (owner: 10Elukey) [09:46:12] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10aborrero) >>! In T227539#5360597, @Marostegui wrote: > From the DBA side, it is good to. db1073 is a master for m5 (wikitech, nova...) #cloud-services-team needs to decide if they can afford a downtime th... [09:46:27] (03PS6) 10Ema: ATS: save and restore CC/Expires when forcing no-cache [puppet] - 10https://gerrit.wikimedia.org/r/525554 (https://phabricator.wikimedia.org/T227432) [09:48:00] (03CR) 10Ema: [C: 03+2] ATS: save and restore CC/Expires when forcing no-cache [puppet] - 10https://gerrit.wikimedia.org/r/525554 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:48:47] (03PS2) 10Elukey: Introduce role::statistics::explorer::gpu for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/526115 [09:49:25] (03CR) 10Fsero: "great job! I have some nits and questions but this is definitely a good start." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526110 (owner: 10Giuseppe Lavagetto) [09:49:28] !log Add db2128 to tendril and zarcillo - T228969 [09:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:37] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [09:56:57] (03CR) 10Fsero: [C: 03+1] "great!" [puppet] - 10https://gerrit.wikimedia.org/r/526108 (owner: 10Alexandros Kosiaris) [09:57:16] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17633/" [puppet] - 10https://gerrit.wikimedia.org/r/526115 (owner: 10Elukey) [09:58:17] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unit_tests_trafficserver] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:01:48] (03PS1) 10Elukey: role::druid::public::worker: add conftool config/scripts [puppet] - 10https://gerrit.wikimedia.org/r/526122 (https://phabricator.wikimedia.org/T229003) [10:03:22] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17636/druid1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/526122 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [10:03:24] (03CR) 10Elukey: [C: 03+2] role::druid::public::worker: add conftool config/scripts [puppet] - 10https://gerrit.wikimedia.org/r/526122 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [10:03:30] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/17635/" [puppet] - 10https://gerrit.wikimedia.org/r/525535 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [10:05:35] (03CR) 10Giuseppe Lavagetto: envoyproxy: create module, add tls terminator definition (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526110 (owner: 10Giuseppe Lavagetto) [10:08:58] (03PS8) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) [10:11:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [10:12:52] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) 05Open→03Resolved @aaron changed deployed, closing the task but pleas... [10:17:17] (03CR) 10Fsero: envoyproxy: create module, add tls terminator definition (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526110 (owner: 10Giuseppe Lavagetto) [10:19:37] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) This failure happened on Sun as well, with wezen and centrallog1001 "stuck" on `recvfrom` from `cloudvirt1015`, e.g. on `centrallog1001` ` 21108 recv... [10:20:39] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190729T1030). [10:31:26] (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: fix mtail's group [puppet] - 10https://gerrit.wikimedia.org/r/526124 [10:31:34] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526125 (https://phabricator.wikimedia.org/T128546) [10:31:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::webserver: fix mtail's group [puppet] - 10https://gerrit.wikimedia.org/r/526124 (owner: 10Giuseppe Lavagetto) [10:33:09] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526125 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:18] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526125 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:50] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:526125| Bumping portals to master (T128546)]] (duration: 00m 47s) [10:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:00] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:37:38] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:526125| Bumping portals to master (T128546)]] (duration: 00m 47s) [10:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:51] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526125 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190729T1100). [11:00:05] dcausse: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:18] o/ [11:00:28] I can SWAT [11:02:30] go ahead :) [11:03:37] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522092 (https://phabricator.wikimedia.org/T216429) (owner: 10DCausse) [11:04:19] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) Opened a new upstream issue: https://github.com/rsyslog/rsyslog/issues/3770 referencing https://github.com/rsyslog/rsyslog/issues/318 too [11:04:43] (03Merged) 10jenkins-bot: Revert "Revert "[cirrus] Use correct factory declaration for EntityFullTextQueryBuilder"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522092 (https://phabricator.wikimedia.org/T216429) (owner: 10DCausse) [11:07:41] (03CR) 10Filippo Giunchedi: monitoring: tweak description for paging alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525536 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [11:08:51] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) [11:10:09] !log dcausse@deploy1001 Synchronized wmf-config/SearchSettingsForWikidata.php: [cirrus] Use correct factory declaration for EntityFullTextQueryBuilder (duration: 00m 47s) [11:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:32] !log EU SWAT done [11:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:30] (03CR) 10jenkins-bot: Revert "Revert "[cirrus] Use correct factory declaration for EntityFullTextQueryBuilder"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522092 (https://phabricator.wikimedia.org/T216429) (owner: 10DCausse) [11:13:33] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:13:34] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:58] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:13:59] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:17] !log T228870 reboot cloudvirt1001.eqiad.wmnet for kernel updates [11:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:54] (03CR) 10Jbond: [C: 04-1] "mostly fine however it currently bombs out as module_from_filename us a private method" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526109 (owner: 10Giuseppe Lavagetto) [11:31:09] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:31:10] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:27] !log T228870 reboot cloudvirt1002.eqiad.wmnet for kernel updates [11:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:54] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:33:41] oh :-( [11:34:15] ACKNOWLEDGEMENT - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.114 second response time Arturo Borrero Gonzalez doing reboots https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:35:41] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:36:15] !log icinga downtime toolschecker for 6h [11:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:41] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:57:44] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:57:45] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:03] !log T228870 reboot cloudvirt1003.eqiad.wmnet for kernel updates [11:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:38] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [12:20:38] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:56] !log T228870 reboot cloudvirt1004.eqiad.wmnet for kernel updates [12:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:27] (03Abandoned) 10Ladsgroup: varnish: Do not strip the cache out of Special:EntityData if revision is set [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499) (owner: 10Ladsgroup) [12:33:28] (03PS1) 10Effie Mouzeli: WIP: jobrunners: Make jobrunners PHP7 only by default [puppet] - 10https://gerrit.wikimedia.org/r/526132 (https://phabricator.wikimedia.org/T219148) [12:33:51] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1017 with 10G interfaces - https://phabricator.wikimedia.org/T228691 (10aborrero) 05Open→03Resolved p:05Triage→03Normal [12:33:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10aborrero) [12:34:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10aborrero) [12:34:46] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1016 with 10G interfaces - https://phabricator.wikimedia.org/T228692 (10aborrero) 05Open→03Resolved p:05Triage→03Normal [12:34:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10aborrero) [12:37:32] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db2128 into s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526135 (https://phabricator.wikimedia.org/T228969) [12:38:48] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [12:38:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:23] !log T228870 reboot cloudvirt1005.eqiad.wmnet for kernel updates [12:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:44] (03CR) 10CDanis: [C: 03+1] db-eqiad,db-codfw.php: Pool db2128 into s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526135 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [12:42:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db2128 into s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526135 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [12:43:34] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2128 into s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526135 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [12:43:46] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove RESTBase graphite alerts. [puppet] - 10https://gerrit.wikimedia.org/r/525856 (https://phabricator.wikimedia.org/T185089) (owner: 10Ppchelko) [12:43:52] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db2128 into s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526135 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [12:44:47] (03PS1) 10Giuseppe Lavagetto: prometheus::ops: collect mtail from the application servers [puppet] - 10https://gerrit.wikimedia.org/r/526136 (https://phabricator.wikimedia.org/T226815) [12:44:49] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2128 into s5 api T221533 (duration: 00m 47s) [12:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:56] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [12:45:25] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:27] !log Provision db2128 into s5 codfw - T228969 [12:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:34] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [12:45:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db2128 into s5 api T221533 (duration: 00m 47s) [12:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/526136 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [12:52:48] (03CR) 10Effie Mouzeli: [C: 03+1] prometheus::ops: collect mtail from the application servers [puppet] - 10https://gerrit.wikimedia.org/r/526136 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [12:54:53] (03PS5) 10Marostegui: db-eqiad.php: Set s8 (wikidata) into read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) [12:55:04] (03PS3) 10Marostegui: db-eqiad.php: Promote db1104 to s8 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) [13:01:10] !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers [13:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:45] (03PS1) 10Ema: varnish: add text/30-beta-mobile-pass.vtc [puppet] - 10https://gerrit.wikimedia.org/r/526143 (https://phabricator.wikimedia.org/T228861) [13:03:16] (03PS1) 10CDanis: dbctl: add instances/sections to syncer data [puppet] - 10https://gerrit.wikimedia.org/r/526144 [13:06:36] (03CR) 10CDanis: [C: 03+1] monitoring: tweak description for paging alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525536 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [13:06:44] (03CR) 10Ema: [C: 03+2] varnish: add text/30-beta-mobile-pass.vtc [puppet] - 10https://gerrit.wikimedia.org/r/526143 (https://phabricator.wikimedia.org/T228861) (owner: 10Ema) [13:07:50] 10Operations, 10MobileFrontend, 10Traffic, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking): Do not cache the beta version of the mobile site - https://phabricator.wikimedia.org/T228861 (10ema) 05Open→03Resolved [13:08:59] (03CR) 10CDanis: [C: 03+1] Consolidate 'critical' and 'contact groups' logic [puppet] - 10https://gerrit.wikimedia.org/r/525535 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [13:09:03] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [13:09:04] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:18] !log T228870 reboot cloudvirt1006.eqiad.wmnet for kernel updates [13:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:46] (03PS4) 10Ema: tlsproxy: toggle dynamic ssl_buffer_size settings [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) [13:12:09] (03CR) 10Ema: [C: 03+2] tlsproxy: toggle dynamic ssl_buffer_size settings [puppet] - 10https://gerrit.wikimedia.org/r/525490 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [13:20:27] (03CR) 10ArielGlenn: "Finally retested this on json dumps and all the rdf varieties in deployment-prep and it all looks good. I'd like to roll with this on Aug " [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [13:23:09] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [13:23:10] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:28] !log T228870 reboot cloudvirt1007.eqiad.wmnet for kernel updates [13:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:23] (03PS1) 10Ema: tlsproxy: conditionally add ssl_ecdhe_curve to XCP [puppet] - 10https://gerrit.wikimedia.org/r/526147 (https://phabricator.wikimedia.org/T228730) [13:25:12] (03PS2) 10CDanis: dbctl: add instances/sections to syncer data [puppet] - 10https://gerrit.wikimedia.org/r/526144 [13:28:25] !log Stop MySQL on pc2010 - T227552 [13:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:32] T227552: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 [13:30:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) [13:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:33] (03PS3) 10Filippo Giunchedi: (WIP) toil: add rsyslog TLS remedy [puppet] - 10https://gerrit.wikimedia.org/r/520207 (https://phabricator.wikimedia.org/T199406) [13:36:59] (03PS2) 10Ema: tlsproxy: conditionally add ssl_ecdhe_curve to XCP [puppet] - 10https://gerrit.wikimedia.org/r/526147 (https://phabricator.wikimedia.org/T228730) [13:37:02] (03PS2) 10Elukey: Add sre.kafka.roll-restart-brokers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) [13:39:34] (03CR) 10Ema: "pcc output looks good to me: https://puppet-compiler.wmflabs.org/compiler1001/17641/" [puppet] - 10https://gerrit.wikimedia.org/r/526147 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [13:41:40] (03CR) 10CDanis: [C: 03+2] "generated from the PHP data, running the dbctl import pipeline locally shows no diffs from 'production'" [puppet] - 10https://gerrit.wikimedia.org/r/526144 (owner: 10CDanis) [13:42:32] (03PS3) 10CDanis: dbctl: add instances/sections to syncer data [puppet] - 10https://gerrit.wikimedia.org/r/526144 [13:46:30] (03PS3) 10Elukey: Add sre.kafka.roll-restart-brokers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) [13:47:46] (03CR) 10Ottomata: [C: 03+1] "Comment nits, but +1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526115 (owner: 10Elukey) [13:49:57] (03CR) 10Elukey: Introduce role::statistics::explorer::gpu for stat1005 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526115 (owner: 10Elukey) [13:52:44] (03CR) 10Ottomata: [C: 03+1] "resource_change working great via eventgate in beta, let's go! Will scap when petr is online today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525854 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [13:56:48] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [13:57:31] !log cdanis@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8816', previous config saved to /var/cache/conftool/dbconfig/20190729-135730-cdanis.json [13:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:45] :O [13:58:34] sweet [13:58:49] cdanis: \o/ [13:59:05] so it is alive! [14:00:06] not quite :D [14:00:16] it is alive in its own little universe [14:00:26] tomorrow we start bridging the two universes into one [14:00:52] (03PS16) 10CDanis: dbctl: monitor for uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) [14:00:59] marostegui will soon loose his record of 100 deployments per day for mediawiki [14:01:05] :D [14:01:36] :D [14:02:10] (03CR) 10CDanis: [C: 03+2] dbctl: monitor for uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/523013 (https://phabricator.wikimedia.org/T197126) (owner: 10CDanis) [14:02:28] elukey: can't wait for that to happen [14:02:54] I know :) [14:04:39] (03PS3) 10Elukey: Introduce role::statistics::explorer::gpu for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/526115 [14:05:23] (03PS8) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 (https://phabricator.wikimedia.org/T228805) [14:05:45] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:06:52] (03CR) 10Andrew Bogott: [C: 03+2] puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 (https://phabricator.wikimedia.org/T228805) (owner: 10Andrew Bogott) [14:07:21] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:07:45] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:15] (03PS1) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [14:10:20] (03CR) 10jerkins-bot: [V: 04-1] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [14:12:37] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:21] PROBLEM - puppet last run on pc1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/puppetlabs/facter/facter.conf] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:15:07] (03CR) 10Ottomata: [C: 03+1] ":)" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [14:19:37] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:19:47] (03PS2) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [14:20:54] (03CR) 10jerkins-bot: [V: 04-1] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [14:24:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] prometheus::ops: collect mtail from the application servers [puppet] - 10https://gerrit.wikimedia.org/r/526136 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [14:24:14] (03PS2) 10Giuseppe Lavagetto: prometheus::ops: collect mtail from the application servers [puppet] - 10https://gerrit.wikimedia.org/r/526136 (https://phabricator.wikimedia.org/T226815) [14:25:09] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 51.71, 25.82, 15.54 https://wikitech.wikimedia.org/wiki/Application_servers [14:26:23] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 22.77, 22.38, 15.13 https://wikitech.wikimedia.org/wiki/Application_servers [14:29:49] RECOVERY - puppet last run on pc1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:35:08] !log shutting down pc2010 for maintenance [14:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:15] PROBLEM - Host pc2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:43:39] (03CR) 10Elukey: Add sre.kafka.roll-restart-brokers.py (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [14:46:45] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:49:23] RECOVERY - Host pc2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.95 ms [14:50:21] (03CR) 10Krinkle: Initial canary of dbctl, db config from etcd (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:53:14] (03PS2) 10CDanis: Initial canary of dbctl, db config from etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) [14:53:25] (03CR) 10CDanis: [C: 04-2] Initial canary of dbctl, db config from etcd (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:54:20] (03CR) 10jerkins-bot: [V: 04-1] Initial canary of dbctl, db config from etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:54:36] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Papaul) Swapped DIMM B1 with DIMM A1 to see if we have the same problem on DIMM A1 if we do, we will have to replace he main-board. @Marostegui Please let me know it the system crash again . Thanks [14:55:06] (03PS3) 10CDanis: Initial canary of dbctl, db config from etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) [14:56:20] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) >>! In T227552#5372881, @Papaul wrote: > Swapped DIMM B1 with DIMM A1 to see if we have the same problem on DIMM A1 if we do, we will have to replace he main-board. > @Marostegui Plea... [15:03:26] (03PS4) 10Elukey: Add sre.kafka.roll-restart-brokers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) [15:04:31] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:04:45] (03PS5) 10Elukey: Add sre.kafka.roll-restart-brokers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) [15:07:37] (03PS1) 10CRusnov: netbox: Add dummy redis passwords [labs/private] - 10https://gerrit.wikimedia.org/r/526161 [15:07:45] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:08:31] (03PS1) 10Marostegui: install_server: Change db2127's MAC address [puppet] - 10https://gerrit.wikimedia.org/r/526162 (https://phabricator.wikimedia.org/T227113) [15:11:01] (03CR) 10Papaul: [C: 03+2] install_server: Change db2127's MAC address [puppet] - 10https://gerrit.wikimedia.org/r/526162 (https://phabricator.wikimedia.org/T227113) (owner: 10Marostegui) [15:11:21] PROBLEM - puppet last run on analytics1068 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:14:01] (03PS1) 10BBlack: anycast recdns: use for all cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/526163 (https://phabricator.wikimedia.org/T228190) [15:14:03] (03PS1) 10BBlack: anycast recdns: use for all cache_text [puppet] - 10https://gerrit.wikimedia.org/r/526164 (https://phabricator.wikimedia.org/T228190) [15:14:24] 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Andrew) [15:15:58] (03PS1) 10Cwhite: monitoring: remove hostname from ssh mgmt definition [puppet] - 10https://gerrit.wikimedia.org/r/526165 [15:16:34] (03CR) 10jerkins-bot: [V: 04-1] monitoring: remove hostname from ssh mgmt definition [puppet] - 10https://gerrit.wikimedia.org/r/526165 (owner: 10Cwhite) [15:17:58] (03CR) 10BBlack: [C: 03+2] anycast recdns: use for all cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/526163 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [15:19:11] (03CR) 10Elukey: [C: 03+2] Add sre.kafka.roll-restart-brokers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [15:21:10] (03PS2) 10Cwhite: monitoring: remove hostname from mgmt definitions [puppet] - 10https://gerrit.wikimedia.org/r/526165 [15:21:11] 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10aborrero) [15:21:44] (03CR) 10jerkins-bot: [V: 04-1] monitoring: remove hostname from mgmt definitions [puppet] - 10https://gerrit.wikimedia.org/r/526165 (owner: 10Cwhite) [15:22:47] (03PS3) 10Cwhite: monitoring: remove hostname from mgmt definitions [puppet] - 10https://gerrit.wikimedia.org/r/526165 [15:22:59] (03PS2) 10Ottomata: [EventBus] Switch resource_change event to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525854 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [15:27:22] (03CR) 10Ottomata: [C: 03+2] [EventBus] Switch resource_change event to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525854 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [15:27:43] (03CR) 10jenkins-bot: [EventBus] Switch resource_change event to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525854 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [15:28:48] 10Operations, 10Analytics, 10SRE-Access-Requests: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10fdans) @Mayakp.wiki just a comment on hue. It might not be the best tool for querying the data lake. We (as in the analytics team) prefer using either hive/beeline directly or jupyter... [15:29:17] 10Operations, 10Analytics, 10SRE-Access-Requests: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10fdans) p:05Triage→03High [15:29:58] (03CR) 10Cwhite: [C: 03+2] profile: cleanup per-site varnishkafka deploy flags [puppet] - 10https://gerrit.wikimedia.org/r/524934 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [15:30:11] (03PS4) 10Cwhite: profile: cleanup per-site varnishkafka deploy flags [puppet] - 10https://gerrit.wikimedia.org/r/524934 (https://phabricator.wikimedia.org/T196066) [15:30:36] (03PS1) 10Andrew Bogott: labs-bootstrapvz: remove debian-backports.list from Buster image [puppet] - 10https://gerrit.wikimedia.org/r/526168 [15:30:43] (03PS1) 10BBlack: anycast recdns: use for all hosts at edge sites [puppet] - 10https://gerrit.wikimedia.org/r/526169 (https://phabricator.wikimedia.org/T228190) [15:30:46] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Produce resource_change stream to eventgate-main - T211248 (duration: 00m 47s) [15:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:54] T211248: Modern Event Platform: Stream Intake Service: Migrate eventlogging-service-eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [15:33:43] RECOVERY - puppet last run on analytics1068 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:34:04] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [15:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:16] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) [15:35:26] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Retry - Produce resource_change stream to eventgate-main - T211248 (duration: 00m 46s) [15:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:48] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) a:05Papaul→03Marostegui @Marostegui all yours [15:38:24] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10colewhite) 05Open→03Resolved [15:38:30] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: Migrate all metrics originated by PoPs from statsd to Prometheus - https://phabricator.wikimedia.org/T220116 (10colewhite) [15:39:22] (03Abandoned) 10Andrew Bogott: labs-bootstrapvz: remove debian-backports.list from Buster image [puppet] - 10https://gerrit.wikimedia.org/r/526168 (owner: 10Andrew Bogott) [15:43:29] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10herron) Hey @RobH, Cross-validate accounts started sending notifications for: ` Membership of ops group in LDAP and YAML are not identical: ['jclark'] ` I see there are... [15:45:12] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10RobH) >>! In T229124#5373133, @herron wrote: > Hey @RobH, Cross-validate accounts started sending notifications for: > > ` > Membership of ops group in LDAP and YAML are... [15:47:49] (03PS2) 10BBlack: anycast recdns: use for all cache_text [puppet] - 10https://gerrit.wikimedia.org/r/526164 (https://phabricator.wikimedia.org/T228190) [15:48:12] (03CR) 10BBlack: [C: 03+2] anycast recdns: use for all cache_text [puppet] - 10https://gerrit.wikimedia.org/r/526164 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [15:51:50] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10elukey) @colewhite awesome work, thanks! Should I start working on creating a new varnishkafka dashboard... [15:53:55] (03PS1) 10CRusnov: netbox: Add configuration for REDIS [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) [15:54:08] jouncebot, next [15:54:09] In 1 hour(s) and 5 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190729T1700) [15:54:58] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) db2127 looking good! ` root@db2127:~# free -g ; df -hT /srv total used free shared buff/cache available M... [15:55:14] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) 05Open→03Resolved [15:55:19] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add configuration for REDIS [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) (owner: 10CRusnov) [15:58:44] (03PS2) 10CRusnov: netbox: Add configuration for REDIS [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) [16:05:57] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) p:05Triage→03High [16:06:08] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [16:14:03] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [16:17:37] !log elukey@cumin1001 END (ERROR) - Cookbook sre.kafka.roll-restart-brokers (exit_code=97) [16:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:46] :( [16:18:14] I did it manually with control+c, the interval is too tight (10mins) [16:18:20] but the cookbook works fine :) [16:18:24] ah ok :D [16:19:09] !log manually stopped the sre.kafka.roll-restart-brokers cookbook after 4 brokers restarts since the sleep interval (10mins) is too tight. [16:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:32] elukey: you could move to a metric-based sleep in the long term [16:19:52] like polling some metric every minute and do the next node when that metric is under/over a certain threshold [16:20:37] volans: yes definitely I was thinking the same.. I check multiple metrics when restarting, but it would be great indeed to have something that stops the restarts if something is off [16:20:55] (03PS2) 10BBlack: anycast recdns: use for all hosts at edge sites [puppet] - 10https://gerrit.wikimedia.org/r/526169 (https://phabricator.wikimedia.org/T228190) [16:20:57] (03PS1) 10BBlack: anycast recdns: use for all install-time DNS [puppet] - 10https://gerrit.wikimedia.org/r/526177 (https://phabricator.wikimedia.org/T228190) [16:20:58] like "hey something is not really great, can you please triple check before I continue?" [16:20:59] (03PS1) 10BBlack: anycast recdns: Add to calico filters [puppet] - 10https://gerrit.wikimedia.org/r/526178 (https://phabricator.wikimedia.org/T228190) [16:21:29] yeah sure, you can use https://doc.wikimedia.org/spicerack/master/api/spicerack.decorators.html#spicerack.decorators.retry to do the polling for example [16:21:37] once you find what and how to poll it :D [16:22:00] ack! [16:27:56] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Jrbranaa) [16:36:48] (03PS1) 10Ottomata: Use schema aware refine for revision score and resource change [puppet] - 10https://gerrit.wikimedia.org/r/526180 (https://phabricator.wikimedia.org/T211248) [16:37:26] (03CR) 10jerkins-bot: [V: 04-1] Use schema aware refine for revision score and resource change [puppet] - 10https://gerrit.wikimedia.org/r/526180 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:37:58] PROBLEM - Host orespoolcounter2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:59] (03PS2) 10Ottomata: Use schema aware refine for revision score and resource change [puppet] - 10https://gerrit.wikimedia.org/r/526180 (https://phabricator.wikimedia.org/T211248) [16:38:12] PROBLEM - Host orespoolcounter1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:38:14] PROBLEM - Host orespoolcounter2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:38:14] PROBLEM - Host orespoolcounter1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:39:21] ^ uhm.. decom? [16:41:06] node /orespoolcounter[12]00[34]\.(codfw|eqiad)\.wmnet/ { [16:41:06] role(orespoolcounter) [16:41:06] } [16:41:09] mutante: --^ [16:41:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/524162 [16:41:17] we should be good [16:41:40] ok, but then these could not alert [16:41:48] ack, they are not in production but that would still not take them down [16:41:54] dunno [16:42:20] (03CR) 10Ayounsi: [C: 03+2] Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037 (owner: 10Ayounsi) [16:42:31] (03PS9) 10Ayounsi: Add an anycast endpoint to syslog centralservers [puppet] - 10https://gerrit.wikimedia.org/r/524037 [16:42:36] XioNoX: neat [16:43:16] cmjohnson1: any chance you powered off or something the orespoolcounter1001/2002/1002? just fishing around for why they went away now [16:44:29] hm maybe a downtime expired [16:44:35] mut ante pointed out: [16:44:44] https://phabricator.wikimedia.org/T227640#5371860 [16:44:55] (03CR) 10Volans: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/526161 (owner: 10CRusnov) [16:44:56] akosiaris: yer downtime expired (we guess) [16:45:15] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [16:47:09] !log add anycast syslog to wezen/centrallog1001 [16:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:03] (03CR) 10Volans: "One question inline, LGTM otherwise. But please check with @serviceops if it's ok" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) (owner: 10CRusnov) [16:48:20] (03PS3) 10Giuseppe Lavagetto: Tests: discover and run tox in module directories [puppet] - 10https://gerrit.wikimedia.org/r/526109 [16:48:21] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [16:49:05] (03CR) 10CRusnov: "thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) (owner: 10CRusnov) [16:49:34] (03CR) 10CRusnov: netbox: Add configuration for REDIS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) (owner: 10CRusnov) [16:51:58] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10RobH) p:05Triage→03Normal [16:52:05] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10RobH) [16:52:10] PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[anycast-healthchecker] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:52:24] that's known ^ [16:53:05] 10Operations, 10ops-eqiad, 10DBA: (2019-08-31)rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10RobH) [16:54:04] 10Operations, 10ops-eqiad, 10DBA: (2019-08-31)rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10RobH) [16:54:25] 10Operations, 10ops-eqiad, 10DBA: (2019-08-31)rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10RobH) [16:56:18] PROBLEM - puppet last run on centrallog1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[anycast-healthchecker] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:57:29] also known ^ [17:00:04] gehel and onimisionipe: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190729T1700). [17:00:26] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Dzahn) I see all Icinga alerts are back to OK now. Looks like this ticket is done. Is that right? [17:00:40] urandom: looks like we are done with restbase-dev1006 ? [17:01:54] RECOVERY - puppet last run on centrallog1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:03:22] RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:03:26] (03CR) 10CDanis: [C: 03+1] Tests: discover and run tox in module directories [puppet] - 10https://gerrit.wikimedia.org/r/526109 (owner: 10Giuseppe Lavagetto) [17:04:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Tests: discover and run tox in module directories (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526109 (owner: 10Giuseppe Lavagetto) [17:05:03] 10Operations, 10Wikimedia-General-or-Unknown, 10Security: Massive spambot registrations at dinwiki - https://phabricator.wikimedia.org/T212519 (10sbassett) 05Open→03Resolved a:03sbassett @Aklapper - I don't think there's anything actionable and nothing in recent rc and new user logs suggests the attack... [17:05:32] !log reprepro copy buster-wikimedia stretch-wikimedia anycast-healthchecker [17:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:56] !log reprepro copy buster-wikimedia stretch-wikimedia python3-json-logger [17:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:13] godog: ^ the two should be in buster now [17:06:32] XioNoX: nice, I think python3-anycast-healthchecker is needed too [17:08:04] !log reprepro copy buster-wikimedia stretch-wikimedia python3-anycast-healthchecker [17:08:07] godog: added [17:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:50] (03PS2) 10Mforns: analytics::refinery::job::data_purge Migrate mediawiki timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519690 (https://phabricator.wikimedia.org/T226862) [17:11:40] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[anycast-healthchecker] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:11:58] (03PS6) 10Urbanecm: Add several rights to eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553) (owner: 10Huji) [17:12:32] (03PS3) 10Mforns: analytics::refinery::job::data_purge Migrate mediawiki timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519690 (https://phabricator.wikimedia.org/T226862) [17:13:43] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Cassandra Operational), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Dzahn) 05Open→03Resolved @eevans Please reopen if something is missing. [17:13:46] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494 (10Dzahn) [17:15:43] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [17:16:09] (03PS1) 10Ayounsi: Anycast syslog, fix check_cmd [puppet] - 10https://gerrit.wikimedia.org/r/526187 [17:17:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, although run PCC to confirm" [puppet] - 10https://gerrit.wikimedia.org/r/526187 (owner: 10Ayounsi) [17:17:43] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [17:19:03] (03CR) 10Ayounsi: [C: 03+2] Anycast syslog, fix check_cmd [puppet] - 10https://gerrit.wikimedia.org/r/526187 (owner: 10Ayounsi) [17:19:45] PROBLEM - Host syslog.anycast.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [17:20:45] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [17:21:29] ACKNOWLEDGEMENT - Host orespoolcounter1001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T227640 [17:21:29] ACKNOWLEDGEMENT - Host orespoolcounter1002 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T227640 [17:21:29] ACKNOWLEDGEMENT - Host orespoolcounter2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T227640 [17:21:29] ACKNOWLEDGEMENT - Host orespoolcounter2002 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T227640 [17:24:16] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10colewhite) Sounds good to me! Happy to help however I can, just let me know. [17:25:01] RECOVERY - Host syslog.anycast.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [17:27:53] chaomodus: i get the Icinga alerts like "Check the Netbox report-s- puppetdb for fail status" and that they are systemd units. but should that also always lead to failed units and therefore "systemd state degraded" each time? [17:28:14] mutante definitely not, one sec [17:28:17] not sure how to avoid it though while keeping the alerts [17:28:21] i think i forgot to send a patch for that [17:28:28] oh, ok ! [17:37:09] PROBLEM - Bird Internet Routing Daemon on lithium is CRITICAL: NRPE: Command check_bird not defined https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS%23Bird_daemon_not_running [17:40:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:41:07] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:42:41] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:43:43] godog: working on lithium? [17:43:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:44:53] XioNoX: ^ I assume that's your patch, the lithium alert. Anycast DNS isn't even supposed to be there, so clearly there's some mixup between DNS and syslog here. [17:45:26] somehow it gets the Icinga check but not the NRPE command definition [17:45:39] mutante, bblack we were working on it [17:45:39] then nagios-nrpe-server can't start due to missing command [17:45:52] ack! [17:45:59] lithium is being decom [17:46:20] the cr2-esams/cr2-eqiad alerts, I assume are more of https://phabricator.wikimedia.org/T228827 [17:46:21] bblack: we have anycast for syslog now :) [17:46:45] XioNoX: yes but that alert is for anycast "DNS", and that's not a DNS server [17:46:57] at least, it mentions DNS on IRC [17:47:22] bblack: ah yeah it's the runbook URL that needs to be updated! [17:48:14] right, ok [17:50:35] (03PS3) 10BBlack: anycast recdns: use for all hosts at edge sites [puppet] - 10https://gerrit.wikimedia.org/r/526169 (https://phabricator.wikimedia.org/T228190) [17:50:37] (03PS2) 10BBlack: anycast recdns: use for all install-time DNS [puppet] - 10https://gerrit.wikimedia.org/r/526177 (https://phabricator.wikimedia.org/T228190) [17:50:39] (03PS2) 10BBlack: anycast recdns: Add to calico filters [puppet] - 10https://gerrit.wikimedia.org/r/526178 (https://phabricator.wikimedia.org/T228190) [17:53:03] (03PS1) 10Ayounsi: Anycast, fix runbook url [puppet] - 10https://gerrit.wikimedia.org/r/526194 [17:55:46] ACKNOWLEDGEMENT - Bird Internet Routing Daemon on lithium is CRITICAL: NRPE: Command check_bird not defined Ayounsi known, decoming https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS%23Bird_daemon_not_running [17:57:47] (03CR) 10Ayounsi: [C: 03+2] Anycast, fix runbook url [puppet] - 10https://gerrit.wikimedia.org/r/526194 (owner: 10Ayounsi) [17:58:56] url fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/526194 [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190729T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:12] * Urbanecm is deploying his patches [18:00:31] Urbanecm: Don't forget to add your patches to the calendar. [18:00:35] will do [18:01:05] (03CR) 10Urbanecm: [C: 03+2] Add several rights to eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553) (owner: 10Huji) [18:01:42] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [18:04:14] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/AbuseFilter: SWAT: [[:gerrit:525598|Initialize user-defined variables during shortcircuit]] (T214674) (duration: 00m 49s) [18:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:22] T214674: Short circuit fails with assignments - https://phabricator.wikimedia.org/T214674 [18:07:26] (03Merged) 10jenkins-bot: Add several rights to eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553) (owner: 10Huji) [18:07:43] (03CR) 10jenkins-bot: Add several rights to eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553) (owner: 10Huji) [18:08:49] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: SWAT: [[:gerrit:430627|Add several rights to eliminators in fawiki]] (T176553, 1/2) (duration: 00m 47s) [18:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:57] T176553: Add rollback, autopatrol, extended confirmed, patroller and uploader to eliminators in fawiki - https://phabricator.wikimedia.org/T176553 [18:10:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:430627|Add several rights to eliminators in fawiki]] (T176553, 2/2) (duration: 00m 47s) [18:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:56] (03PS1) 10Urbanecm: Rename Image-reviewer to image-reviewer on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526196 (https://phabricator.wikimedia.org/T216406) [18:14:47] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526196 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [18:15:16] (03PS1) 10Petar.petkovic: Decrease idwiki MT threshold for publishing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526197 (https://phabricator.wikimedia.org/T228971) [18:17:59] !log switch traffic to the GTT link between Ashburn and Amsterdam (set GTT metric to 820 vs. 1820 before) - T228827 [18:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:05] T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 [18:18:26] 10Operations: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10elukey) Following the Debian Bug, it seems that in https://sources.debian.org/src/tmpreaper/1.6.13+nmu1/tmpreaper.c/?hl=452#L422 we could add a simple check to avoid this. From ` if (lstat (ent->d_name, &sb)) {... [18:18:31] (03Merged) 10jenkins-bot: Rename Image-reviewer to image-reviewer on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526196 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [18:18:46] (03CR) 10jenkins-bot: Rename Image-reviewer to image-reviewer on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526196 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [18:18:48] (03Abandoned) 10Andrew Bogott: openldap: spruce up the anti-memory-leak cron for replicas [puppet] - 10https://gerrit.wikimedia.org/r/498902 (owner: 10Andrew Bogott) [18:19:50] !log Run mwscript migrateUserGroup.php --wiki=fawiki Image-reviewer image-reviewer (T216406) [18:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:57] T216406: Rename `Image-reviewer` to `image-reviewer`, then migrate all its members - https://phabricator.wikimedia.org/T216406 [18:20:19] (03CR) 10Andrew Bogott: "for the record: This doesn't actually affect VM behavior; they are controlled by 'labs.yaml' which already reflects this change. This ch" [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff) [18:20:23] (03CR) 10Andrew Bogott: [C: 03+1] Switch back Cloud VPS instances to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff) [18:21:55] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:526196|Rename Image-reviewer to image-reviewer on fawiki]] (T216406) (duration: 00m 47s) [18:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:08] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10ayounsi) We can see that link flapping in https://librenms.wikimedia.org/device/device=2/tab=port/port=6835/view=events/ as well. I think only one of those was a planned mainten... [18:22:20] 10Operations, 10netops: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10ayounsi) a:03ayounsi [18:22:37] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) server is still alerting https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=thumbor1004&service=Memory+correctable+errors+-EDAC- needs... [18:23:36] !log Morning SWAT done [18:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:36] (03CR) 10Ayounsi: [C: 03+2] Anycast: Add Prometheus exporter to Bird [puppet] - 10https://gerrit.wikimedia.org/r/525659 (owner: 10Ayounsi) [18:26:48] (03PS4) 10Ayounsi: Anycast: Add Prometheus exporter to Bird [puppet] - 10https://gerrit.wikimedia.org/r/525659 [18:28:23] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:29:04] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [18:30:40] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [18:31:36] (03PS1) 10Ayounsi: Revert "Anycast: Add Prometheus exporter to Bird" [puppet] - 10https://gerrit.wikimedia.org/r/526199 [18:32:15] that's me ^ I'm rolling back the change [18:32:56] (03CR) 10Ayounsi: [C: 03+2] Revert "Anycast: Add Prometheus exporter to Bird" [puppet] - 10https://gerrit.wikimedia.org/r/526199 (owner: 10Ayounsi) [18:33:51] PROBLEM - puppet last run on dns1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:36:21] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:36:25] for some reason that error didn't get caught on the puppet compiller [18:45:12] (03PS1) 10Dzahn: add certbot renewal config for Letsencrypt [wikitech-static] - 10https://gerrit.wikimedia.org/r/526200 (https://phabricator.wikimedia.org/T214640) [18:53:41] (03PS1) 10Ayounsi: Anycast: Add Prometheus exporter to Bird (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/526203 [18:57:33] (03PS2) 10Dzahn: add certbot renewal config for Letsencrypt [wikitech-static] - 10https://gerrit.wikimedia.org/r/526200 (https://phabricator.wikimedia.org/T214640) [18:57:52] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add certbot renewal config for Letsencrypt [wikitech-static] - 10https://gerrit.wikimedia.org/r/526200 (https://phabricator.wikimedia.org/T214640) (owner: 10Dzahn) [18:58:14] 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review: wikitech-static cert renewal seems to stop apache2 - https://phabricator.wikimedia.org/T214640 (10Dzahn) 05Open→03Resolved [19:00:28] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:00:30] RECOVERY - puppet last run on dns1002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:10:54] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@c3ffbee]: Weekly deploy [19:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:28] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:15:14] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [19:19:02] (03CR) 10Krinkle: "The related core change has landed and will roll out this week - https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/525961/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 (owner: 10Aaron Schulz) [19:19:34] (03PS3) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [19:20:44] (03CR) 10jerkins-bot: [V: 04-1] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [19:22:37] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@c3ffbee]: Weekly deploy (duration: 11m 42s) [19:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:02] (03PS4) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [19:25:10] (03CR) 10jerkins-bot: [V: 04-1] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [19:26:50] (03PS5) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [19:27:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) I don't see any errors in the Service Event Log: ` /admin1-> racadm getsel Record: 1 Date/Time: 07/25/2019... [19:27:26] (03CR) 10jerkins-bot: [V: 04-1] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [19:29:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10RobH) ePSA Pre-boot System Assessment is now running, will update task with results [19:35:16] (03PS6) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [19:35:53] (03CR) 10jerkins-bot: [V: 04-1] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [19:38:46] (03PS7) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [19:39:20] (03CR) 10jerkins-bot: [V: 04-1] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [19:40:37] (03PS8) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [19:41:10] (03CR) 10jerkins-bot: [V: 04-1] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [19:41:42] sigh, jerkins [19:58:13] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: Your horoscope predicts another unfortunate Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190729T2000). [20:00:17] no parsoid deploy today ... [20:00:20] lol jerkins [20:04:14] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) [20:04:29] (03PS9) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [20:05:35] (03CR) 10jerkins-bot: [V: 04-1] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [20:08:00] (03PS10) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [20:09:18] (03CR) 10jerkins-bot: [V: 04-1] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [20:15:16] (03PS1) 10CDanis: pep8: ignore failing openstack files [puppet] - 10https://gerrit.wikimedia.org/r/526243 [20:21:34] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:40:20] (03PS2) 10CDanis: pep8: ignore failing openstack files [puppet] - 10https://gerrit.wikimedia.org/r/526243 (https://phabricator.wikimedia.org/T229274) [20:41:10] (03PS3) 10CDanis: pep8: ignore failing openstack files [puppet] - 10https://gerrit.wikimedia.org/r/526243 (https://phabricator.wikimedia.org/T229274) [20:51:58] (03CR) 10Andrew Bogott: [C: 04-1] "let's just fix the errors. I only see half a dozen or so." [puppet] - 10https://gerrit.wikimedia.org/r/526243 (https://phabricator.wikimedia.org/T229274) (owner: 10CDanis) [21:00:04] Reedy and sbassett: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190729T2100). [21:00:36] (03PS1) 10Bstorm: maintain-kubeusers: set timeout on reading LDAP data [puppet] - 10https://gerrit.wikimedia.org/r/526250 (https://phabricator.wikimedia.org/T194859) [21:01:56] (03PS2) 10Bstorm: maintain-kubeusers: set timeout on reading LDAP data [puppet] - 10https://gerrit.wikimedia.org/r/526250 (https://phabricator.wikimedia.org/T194859) [21:04:29] (03CR) 10BryanDavis: [C: 03+1] "Worth a shot!" [puppet] - 10https://gerrit.wikimedia.org/r/526250 (https://phabricator.wikimedia.org/T194859) (owner: 10Bstorm) [21:05:14] (03CR) 10Bstorm: [C: 03+2] maintain-kubeusers: set timeout on reading LDAP data [puppet] - 10https://gerrit.wikimedia.org/r/526250 (https://phabricator.wikimedia.org/T194859) (owner: 10Bstorm) [21:08:20] (03CR) 10Dzahn: [C: 03+1] Consolidate 'critical' and 'contact groups' logic [puppet] - 10https://gerrit.wikimedia.org/r/525535 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [21:08:54] 10Operations, 10Analytics, 10EventBus, 10MassMessage, and 2 others: Write incident report for jobs not being executed on 1.34.0-wmf.10 - https://phabricator.wikimedia.org/T226109 (10Pchelolo) I've made a preliminary incident report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190619-JobQ... [21:10:26] (03PS1) 10Bstorm: Revert "maintain-kubeusers: set timeout on reading LDAP data" [puppet] - 10https://gerrit.wikimedia.org/r/526251 [21:13:06] (03CR) 10Bstorm: [C: 03+2] Revert "maintain-kubeusers: set timeout on reading LDAP data" [puppet] - 10https://gerrit.wikimedia.org/r/526251 (owner: 10Bstorm) [21:26:40] (03PS1) 10Jforrester: mediawiki::php: Don't install gd any more, ZeroBanner is gone [puppet] - 10https://gerrit.wikimedia.org/r/526255 (https://phabricator.wikimedia.org/T227734) [21:26:49] (03PS1) 10BryanDavis: openstack: Fix flake8 warnings in admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) [21:27:32] (03CR) 10Jforrester: "Almost but not absolutely sure that this isn't used." [puppet] - 10https://gerrit.wikimedia.org/r/526255 (https://phabricator.wikimedia.org/T227734) (owner: 10Jforrester) [21:29:29] (03PS1) 10Bstorm: maintain-kubeusers: allow ldap3 to raise exceptions [puppet] - 10https://gerrit.wikimedia.org/r/526258 (https://phabricator.wikimedia.org/T194859) [21:30:24] (03PS2) 10Jforrester: apache: Stop aliasing zero.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) [21:30:38] (03PS3) 10Jforrester: apache: Stop aliasing zero.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) [21:31:00] (03PS2) 10BryanDavis: openstack: Fix flake8 warnings in admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) [21:31:02] (03PS2) 10Bstorm: maintain-kubeusers: allow ldap3 to raise exceptions [puppet] - 10https://gerrit.wikimedia.org/r/526258 (https://phabricator.wikimedia.org/T194859) [21:31:06] (03CR) 10CDanis: [C: 03+1] openstack: Fix flake8 warnings in admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) (owner: 10BryanDavis) [21:31:38] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:33:19] (03PS3) 10Bstorm: maintain-kubeusers: allow ldap3 to raise exceptions [puppet] - 10https://gerrit.wikimedia.org/r/526258 (https://phabricator.wikimedia.org/T194859) [21:38:56] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/17642/" [puppet] - 10https://gerrit.wikimedia.org/r/526203 (owner: 10Ayounsi) [21:39:16] (03PS3) 10BryanDavis: openstack: Fix flake8 warnings in admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) [21:40:20] (03CR) 10jerkins-bot: [V: 04-1] openstack: Fix flake8 warnings in admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) (owner: 10BryanDavis) [21:41:40] (03PS4) 10BryanDavis: openstack: Fix flake8 warnings in admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) [21:41:46] (03CR) 10BryanDavis: openstack: Fix flake8 warnings in admin_scripts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) (owner: 10BryanDavis) [21:43:26] !log replace ulsfo network devices' syslog target with syslog.anycast.wmnet [21:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:08] (03CR) 10BryanDavis: "See also:" [puppet] - 10https://gerrit.wikimedia.org/r/526116 (owner: 10Giuseppe Lavagetto) [21:44:55] (03CR) 10BryanDavis: openstack: Fix flake8 warnings in admin_scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) (owner: 10BryanDavis) [21:45:24] (03Abandoned) 10CDanis: pep8: ignore failing openstack files [puppet] - 10https://gerrit.wikimedia.org/r/526243 (https://phabricator.wikimedia.org/T229274) (owner: 10CDanis) [21:47:58] (03CR) 10BryanDavis: openstack: Fix flake8 warnings in admin_scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) (owner: 10BryanDavis) [22:00:14] !log krinkle@deploy1001: Dirty git status on extensions/AbusesFilter and extensions/CheckUser in php-1.34.0-wmf.15 [22:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:23] Urbanecm: ^ I don't think your deployment worked the way you intended. [22:00:39] Patch has not been applied. [22:00:43] ahh... [22:00:56] If you're around in 10min I can roll it out :) [22:00:58] After this.. [22:01:09] no problem, can wait, watching a movie :) [22:01:35] I recommend using a git-status text representation in your PS1 prompt. E.g. will show something like "(master)" by default and "(master +%)" when stuff is dirty. [22:01:47] (03CR) 10Volans: "Looks good in general, some comment inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [22:01:48] i use it [22:02:02] but when you're not in php directory, you see /srv/mediawiki-stagging [22:02:07] Good point. [22:02:09] and that's where i start scap sync-file [22:02:17] because then i can use tab completion [22:02:21] Yeah, and in the php directory, CheckUser was already dirty. [22:02:27] true [22:02:32] you were in the php directory presumably to pull the git repo. [22:02:48] but yeah, once one repo is dirty, it quickly spreads. [22:02:58] PROBLEM - Device not healthy -SMART- on ms-be2021 is CRITICAL: cluster=swift device=cciss,13 instance=ms-be2021:9100 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2021&var-datasource=codfw+prometheus/ops [22:03:16] yeah, but probably oversaw it [22:03:26] anyway, do you plan to roll it out, or should i fix it? [22:05:00] !log replace ulsfo network devices' DNS target with 10.3.0.1 [22:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:20] PROBLEM - HP RAID on ms-be2021 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:4 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:05:24] ACKNOWLEDGEMENT - HP RAID on ms-be2021 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:4 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T229283 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:05:27] 10Operations, 10ops-codfw: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T229283 (10ops-monitoring-bot) [22:06:18] 10Operations, 10ops-eqiad, 10DC-Ops: add all remaining new pdus to netbox - https://phabricator.wikimedia.org/T229284 (10RobH) [22:07:44] Krinkle, ^^ [22:08:48] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdf] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:09:43] Urbanecm: I'll roll it out if you're around to verify anything that can be verified. [22:09:48] In ~ 5min or so. [22:09:54] waiting or CI on my patch. [22:09:59] but can do yours first if you have time now [22:10:34] there's not much to be verified, "no new errors and less errors we know about" is the requirement [22:11:04] Urbanecm: ok, on mwdebug1002 now [22:11:10] thanks [22:12:16] 10Operations, 10ops-eqiad, 10DC-Ops: add all remaining new pdus to netbox - https://phabricator.wikimedia.org/T229284 (10RobH) [22:13:50] 10Operations, 10ops-eqiad, 10DC-Ops: add all remaining new pdus to netbox - https://phabricator.wikimedia.org/T229284 (10RobH) [22:15:06] Krinkle, looks good to me [22:15:33] 10Operations, 10ops-eqiad, 10DC-Ops: add all remaining new pdus to netbox - https://phabricator.wikimedia.org/T229284 (10RobH) a:05RobH→03Jclark-ctr [22:16:08] Urbanecm: okay, rolling out [22:16:50] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/AbuseFilter/: T214674 - 940955ea3844721a0 (duration: 00m 48s) [22:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:58] T214674: Short circuit fails with assignments - https://phabricator.wikimedia.org/T214674 [22:17:16] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [22:17:42] thanks Krinkle [22:18:24] 10Operations, 10ops-eqiad, 10DC-Ops: add all remaining new pdus to netbox - https://phabricator.wikimedia.org/T229284 (10RobH) [22:22:55] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10wiki_willy) a:03Cmjohnson System is in-warranty (doesn't expire until May 2020) [22:23:00] (03CR) 10BryanDavis: [C: 03+1] maintain-kubeusers: allow ldap3 to raise exceptions [puppet] - 10https://gerrit.wikimedia.org/r/526258 (https://phabricator.wikimedia.org/T194859) (owner: 10Bstorm) [22:24:02] 10Operations, 10ops-eqiad: Degraded RAID on sulfur - https://phabricator.wikimedia.org/T229134 (10wiki_willy) a:03Cmjohnson [22:25:07] 10Operations, 10ops-codfw: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T229283 (10wiki_willy) a:03Papaul [22:26:58] Urbanecm: AbuseFilter is logging 100,000 times more than usual. [22:27:01] Krinkle, oh oh, doesn't seem good on traffic... [22:27:06] just noticed that as well [22:27:08] let's roll it back [22:27:11] should i, or will you? [22:27:29] mwdebug* isn't that good when you test by looking on logs [22:28:40] !log roll out anycast DNS and syslog to all network devices - T228190 [22:28:44] yeah, reverting now [22:28:48] thanks [22:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:49] T228190: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 [22:29:12] Urbanecm: staged on mwdebug1002 [22:29:18] together with my core patch [22:29:33] Krinkle, you mean, the revert? what's the purpose of stagging a revert? [22:31:04] Urbanecm: just in case :) [22:31:12] shouldn't make it worse than it was [22:31:31] Yeah, but I don't trust MW in prod. Sometimes forwards is not the same as backwards [22:31:38] Just making sure an edit still works without more warnings [22:31:45] let me try to save few things [22:31:53] rolling out now [22:32:18] ack [22:32:37] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/AbuseFilter/: T214674 - bfcaf0c26d6 (duration: 00m 48s) [22:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:46] T214674: Short circuit fails with assignments - https://phabricator.wikimedia.org/T214674 [22:32:57] thanks once again :) [22:33:05] yw [22:33:44] i've set the task's priority to ubn, since it's a train blocker now [22:33:48] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.15/includes/cache/MessageCache.php: T208897 - fa817b088e43975 (duration: 00m 47s) [22:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:00] T208897: PHP Fatal Error: Argument passed to MessageCache::isMainCacheable() must be array - https://phabricator.wikimedia.org/T208897 [22:34:56] Urbanecm: thanks [22:35:00] yw [22:35:10] three blockers before Tuesday, not a great start :/ [22:36:16] Indeed. [22:36:17] (03CR) 10Bstorm: [C: 03+2] maintain-kubeusers: allow ldap3 to raise exceptions [puppet] - 10https://gerrit.wikimedia.org/r/526258 (https://phabricator.wikimedia.org/T194859) (owner: 10Bstorm) [22:36:42] (03CR) 10CRusnov: [V: 03+2 C: 03+2] netbox: Add dummy redis passwords [labs/private] - 10https://gerrit.wikimedia.org/r/526161 (owner: 10CRusnov) [22:39:43] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Jdforrester-WMF) Pinging @herron who is apparently on SRE clinic duty this week. [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190729T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:13] Oh I have one that I forgot to resiter [23:00:17] I'll just deploy it myself [23:09:36] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10serviceops-radar, and 7 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10WDoranWMF) [23:09:48] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10WDoranWMF) [23:10:39] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10serviceops-radar, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10WDoranWMF) [23:12:41] (03CR) 10Andrew Bogott: openstack: Fix flake8 warnings in admin_scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) (owner: 10BryanDavis) [23:12:48] (03PS5) 10Andrew Bogott: openstack: Fix flake8 warnings in admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) (owner: 10BryanDavis) [23:14:03] (03CR) 10Andrew Bogott: [C: 03+2] openstack: Fix flake8 warnings in admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526256 (https://phabricator.wikimedia.org/T229274) (owner: 10BryanDavis) [23:19:10] (03CR) 10Dzahn: "Class[Profile::Mediawiki::Webserver]: has no parameter named 'vhost_feature_flags'" [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [23:22:29] !log replace export policy BGP_Wikimedia_own_space with BGP_Wikimedia_no_dfz in Dallas [23:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:15] !log replace export policy BGP_Wikimedia_own_space with BGP_Wikimedia_no_dfz in ulsfo [23:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:23] 10Operations, 10Puppet: Cache some facter facts - https://phabricator.wikimedia.org/T228805 (10Andrew) 05Open→03Resolved [23:35:59] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/GrowthExperiments/: Make welcome and discovery tours fully mutually exclusive (T229044) (duration: 00m 48s) [23:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:06] T229044: Homepage: users should only receive one discovery GuidedTour - https://phabricator.wikimedia.org/T229044 [23:37:20] !log replace export policy BGP_Wikimedia_own_space with BGP_Wikimedia_no_dfz in ams [23:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:29] Urbanecm: oh wow, I didn't think it could cause demotions and other side-effects from new filter violations. That's really strange. If it's "just" AF parse errors, those are ignored and means the filter is ignored. But in this case it seems to have triggered more filters including their consequences/punishments. [23:48:37] That's pretty bad, we need better tests for that for sure. [23:49:54] (03PS2) 10Dzahn: mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [23:51:10] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto)