[00:00:00] <James_F>	 "(Cannot access the database: Cannot access the database: No working replica DB server: Unknown error (10.64.48.115))"
[00:00:00] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[00:00:00] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:03] <icinga-wm>	 RECOVERY - Host analytics1044 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[00:00:03] <icinga-wm>	 PROBLEM - Host analytics1076 is DOWN: PING CRITICAL - Packet loss = 100%
[00:00:03] <icinga-wm>	 PROBLEM - Host backup1001 is DOWN: PING CRITICAL - Packet loss = 100%
[00:00:03] <icinga-wm>	 PROBLEM - Host cp1087 is DOWN: PING CRITICAL - Packet loss = 100%
[00:00:03] <icinga-wm>	 PROBLEM - Host cp1088 is DOWN: PING CRITICAL - Packet loss = 100%
[00:00:03] <icinga-wm>	 PROBLEM - Host ms-be1043 is DOWN: PING CRITICAL - Packet loss = 100%
[00:00:04] <icinga-wm>	 PROBLEM - Host ms-be1048 is DOWN: PING CRITICAL - Packet loss = 100%
[00:00:04] <icinga-wm>	 PROBLEM - HHVM rendering on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:07] <icinga-wm>	 PROBLEM - HHVM rendering on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:07] <icinga-wm>	 PROBLEM - Apache HTTP on mw1245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:07] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 
[00:00:07] <icinga-wm>	 1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia
[00:00:07] <icinga-wm>	 s/Monitoring/mobileapps
[00:00:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1227 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 937 bytes in 0.610 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:09] <icinga-wm>	 PROBLEM - Apache HTTP on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:09] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:10] <icinga-wm>	 PROBLEM - Restbase root url on restbase1027 is CRITICAL: connect to address 10.64.48.183 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[00:00:10] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-clust
[00:00:11] <icinga-wm>	 method=POST
[00:00:11] <icinga-wm>	 RECOVERY - Host db1093 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[00:00:12] <icinga-wm>	 PROBLEM - HHVM rendering on mw1254 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 871 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:13] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1267 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 914 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:00:15] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:15] <icinga-wm>	 PROBLEM - HHVM rendering on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:17] <icinga-wm>	 PROBLEM - HHVM rendering on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 937 bytes in 0.538 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:17] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:17] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:19] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 938 bytes in 1.373 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:21] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:00:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:23] <icinga-wm>	 PROBLEM - HHVM rendering on mw1266 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 938 bytes in 1.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:23] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:23] <icinga-wm>	 PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:23] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:24] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1322 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 915 bytes in 0.452 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:25] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition
[00:00:25] <icinga-wm>	 rned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:00:25] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a re
[00:00:25] <icinga-wm>	 ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:00:26] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200): /{domain}/v1/data/css/mobile/site (Get site-
[00:00:26] <icinga-wm>	 ed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html
[00:00:27] <icinga-wm>	 view mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for
[00:00:27] <icinga-wm>	 eturned the unexpected status 503 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve featured article info for unsupported site (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[00:00:28] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[00:00:28] <icinga-wm>	 received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:00:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 907 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw1242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 937 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:30] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:30] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a re
[00:00:31] <icinga-wm>	 ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:00:31] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:00:32] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1249 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 974 bytes in 8.357 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:32] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpe
[00:00:33] <icinga-wm>	 expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:00:33] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1314 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 883 bytes in 7.820 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:34] <icinga-wm>	 PROBLEM - HHVM rendering on mw1345 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 904 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:34] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.cre
[00:00:35] <icinga-wm>	 good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:00:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:00:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org
[00:00:36] <icinga-wm>	 nitoring/restbase
[00:00:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org
[00:00:37] <icinga-wm>	 nitoring/restbase
[00:00:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:00:38] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1250 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 938 bytes in 7.376 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:39] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:39] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1345 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 886 bytes in 9.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:40] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204,205} handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&
[00:00:40] <icinga-wm>	 rver&var-method=GET
[00:00:41] <icinga-wm>	 PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:41] <icinga-wm>	 RECOVERY - Host analytics1043 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[00:00:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:43] <icinga-wm>	 PROBLEM - HHVM rendering on mw1247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:43] <icinga-wm>	 PROBLEM - HHVM rendering on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:44] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:45] <icinga-wm>	 PROBLEM - Host an-worker1093 is DOWN: PING CRITICAL - Packet loss = 100%
[00:00:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:46] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:46] <icinga-wm>	 PROBLEM - HHVM rendering on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:00:47] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:00:48] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[00:00:51] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:00:51] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:00:53] <icinga-wm>	 PROBLEM - Host cloudelastic1004 is DOWN: PING CRITICAL - Packet loss = 100%
[00:00:55] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) i
[00:00:55] <icinga-wm>	 retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test G
[00:00:55] <icinga-wm>	 st page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:00:55] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was receiv
[00:00:55] <icinga-wm>	 article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:00:56] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.
[00:00:56] <icinga-wm>	  - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:00:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/references/{title} (Get references from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:00:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:00:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:00:59] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[00:00:59] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[00:01:01] <icinga-wm>	 PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:04] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 592 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:07] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:01:08] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:09] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Moni
[00:01:09] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test art
[00:01:09] <icinga-wm>	 elike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:01:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Mon
[00:01:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:01:13] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 80510 bytes in 4.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:15] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/references/{title} (Get references from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a re
[00:01:15] <icinga-wm>	 ed: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[00:01:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:01:15] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:19] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 80509 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:21] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 504 (expecting: 404): /{domain}/v1/caption/translation/from/{s
[00:01:21] <icinga-wm>	 } (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/S
[00:01:21] <icinga-wm>	 g/recommendation_api
[00:01:27] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:01:27] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 504 (expecting: 404): /{domain}/v1/caption/translation/from/{s
[00:01:27] <icinga-wm>	 } (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:01:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:01:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:01:31] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:32] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (re
[00:01:32] <icinga-wm>	 mage data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:01:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:39] <icinga-wm>	 RECOVERY - HHVM rendering on mw1245 is OK: HTTP OK: HTTP/1.1 200 OK - 80420 bytes in 4.992 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:39] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/feature
[00:01:39] <icinga-wm>	 {day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:01:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:39] <icinga-wm>	 RECOVERY - HHVM rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:41] <icinga-wm>	 PROBLEM - AQS root url on aqs1005 is CRITICAL: connect to address 10.64.32.138 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[00:01:41] <icinga-wm>	 PROBLEM - Restbase root url on restbase1024 is CRITICAL: connect to address 10.64.16.121 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[00:01:41] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:43] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:44] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1342 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:44] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.808 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:44] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase1024.eqiad.wmnet are marked down but pooled: restbase_7231: Servers restbase1026.eqiad.wmnet are marked down but pooled: restbase-backend_7233: Servers restbase1024.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:01:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:01:45] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 2.251 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:01:47] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:47] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 6.724 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:47] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 3.994 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:48] <icinga-wm>	 RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:48] <icinga-wm>	 RECOVERY - Apache HTTP on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:49] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:49] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:50] <icinga-wm>	 RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 4.288 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:50] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1251 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:51] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 402 threshold =0.15 breach: timed_out: False, relocating_shards: 0, number_of_in_flight_fetch: 0, status: yellow, active_shards: 1677, delayed_unassigned_shards: 0, number_of_nodes: 3, initializing_shards: 4, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, active_primary_shards: 693, n
[00:01:51] <icinga-wm>	 es: 3, unassigned_shards: 398, cluster_name: cloudelastic-omega-eqiad, active_shards_percent_as_number: 80.66378066378066 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:01:52] <icinga-wm>	 RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 80420 bytes in 1.862 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:52] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 1.931 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw1257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 592 bytes in 2.648 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:53] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:54] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1240 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 2.303 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:54] <icinga-wm>	 RECOVERY - HHVM rendering on mw1251 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:01:55] <icinga-wm>	 RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:57] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[00:01:57] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 1.793 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:01:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw1256.eqiad.wmnet, mw1242.eqiad.wmnet, mw1249.eqiad.wmnet, mw1243.eqiad.wmnet, mw1257.eqiad.wmnet, mw1244.eqiad.wmnet, mw1254.eqiad.wmnet, mw1252.eqiad.wmnet, mw1270.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1232.eqiad.wmnet, mw1226.eqiad.wmnet, mw1222.eqiad.wmnet, mw1225.eqiad.wmnet, mw1221.eqiad.
[00:01:57] <icinga-wm>	 ad.wmnet, mw1224.eqiad.wmnet, mw1235.eqiad.wmnet, mw1231.eqiad.wmnet are marked down but pooled: appservers-https_443: Servers mw1238.eqiad.wmnet, mw1242.eqiad.wmnet, mw1240.eqiad.wmnet, mw1246.eqiad.wmnet, mw1249.eqiad.wmnet, mw1256.eqiad.wmnet, mw1241.eqiad.wmnet, mw1244.eqiad.wmnet, mw1254.eqiad.wmnet, mw1270.eqiad.wmnet, mw1247.eqiad.wmnet, mw1239.eqiad.wmnet are marked down but pooled: restbase-backend_7233: Servers restbase
[00:01:58] <icinga-wm>	 are marked down but pooled: api_80: Servers mw1232.eqiad.wmnet, mw1229.eqiad.wmnet, mw1226.eqiad.wmnet, mw1225.eqiad.wmnet, mw1223.eqiad.wmnet, mw1231.eqiad.wmnet, mw1230.eqiad.wmnet ar https://wikitech.wikimedia.org/wiki/PyBal
[00:01:58] <icinga-wm>	 RECOVERY - Apache HTTP on mw1245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:59] <icinga-wm>	 RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:01:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw1240 is OK: HTTP OK: HTTP/1.1 200 OK - 80420 bytes in 2.221 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:00] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:03] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:03] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:04] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.160 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:04] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:04] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.319 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:05] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 1.950 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:05] <icinga-wm>	 RECOVERY - HHVM rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 80420 bytes in 2.440 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:06] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&v
[00:02:06] <icinga-wm>	 pserver&var-method=POST
[00:02:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 2.090 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1324 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:09] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1267 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:09] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:09] <icinga-wm>	 RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:09] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:10] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_in_flight_fetch: 0, number_of_nodes: 6, timed_out: False, delayed_unassigned_shards: 0, unassigned_shards: 97, active_shards: 389, active_shards_percent_as_number: 79.38775510204081, active_primary_shards: 213, number_of_data_nodes: 3, status: yellow, initializing_shards: 4, number_of_pend
[00:02:10] <icinga-wm>	 uster_name: production-logstash-eqiad, relocating_shards: 0, task_max_waiting_in_queue_millis: 30669 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:02:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1254 is OK: HTTP OK: HTTP/1.1 200 OK - 80420 bytes in 2.346 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:13] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:13] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:02:14] <icinga-wm>	 RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:14] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1249 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:15] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:02:17] <icinga-wm>	 PROBLEM - Restbase root url on restbase1022 is CRITICAL: connect to address 10.64.16.113 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[00:02:17] <icinga-wm>	 RECOVERY - HHVM rendering on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:17] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:17] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:17] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1322 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:02:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 80416 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:19] <icinga-wm>	 RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:20] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 384 threshold =0.15 breach: timed_out: False, task_max_waiting_in_queue_millis: 0, initializing_shards: 4, relocating_shards: 0, delayed_unassigned_shards: 0, number_of_data_nodes: 3, cluster_name: cloudelastic-omega-eqiad, number_of_nodes: 3, number_of_in_flight_fetch: 0, active_primary_shards: 693, n
[00:02:20] <icinga-wm>	 tasks: 1, active_shards_percent_as_number: 81.52958152958153, active_shards: 1695, status: yellow, unassigned_shards: 380 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:02:21] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 354 threshold =0.15 breach: active_shards: 1764, unassigned_shards: 350, number_of_nodes: 3, timed_out: False, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, status: yellow, number_of_data_nodes: 3, active_shards_percent_as_number: 83.28611898016997, number_of_in_flight_fetch: 0, cl
[00:02:21] <icinga-wm>	 elastic-psi-eqiad, active_primary_shards: 706, relocating_shards: 0, initializing_shards: 4, number_of_pending_tasks: 1 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:02:22] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:22] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1250 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 1.969 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:23] <icinga-wm>	 PROBLEM - Restbase root url on restbase1016 is CRITICAL: connect to address 10.64.0.31 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[00:02:23] <icinga-wm>	 RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 80420 bytes in 2.366 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:24] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 1.960 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:24] <icinga-wm>	 RECOVERY - HHVM rendering on mw1247 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:25] <icinga-wm>	 RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:25] <icinga-wm>	 RECOVERY - HHVM rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:26] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 381 threshold =0.15 breach: number_of_data_nodes: 3, active_primary_shards: 693, status: yellow, cluster_name: cloudelastic-omega-eqiad, relocating_shards: 0, timed_out: False, number_of_in_flight_fetch: 0, number_of_nodes: 3, unassigned_shards: 377, number_of_pending_tasks: 0, delayed_unassigned_shard
[00:02:26] <icinga-wm>	 ds_percent_as_number: 81.67388167388168, active_shards: 1698, initializing_shards: 4, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:02:27] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:27] <icinga-wm>	 RECOVERY - HHVM rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 80420 bytes in 2.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:28] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 2.207 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:28] <icinga-wm>	 RECOVERY - HHVM rendering on mw1250 is OK: HTTP OK: HTTP/1.1 200 OK - 80420 bytes in 2.012 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 80418 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:33] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:33] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 80458 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:02:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:02:41] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:43] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 80459 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:43] <icinga-wm>	 RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 80461 bytes in 1.885 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:43] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 341 threshold =0.15 breach: status: yellow, task_max_waiting_in_queue_millis: 0, number_of_nodes: 3, number_of_data_nodes: 3, initializing_shards: 4, timed_out: False, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 83.89990557129367, unassigned_shards: 337, r
[00:02:43] <icinga-wm>	  0, delayed_unassigned_shards: 0, cluster_name: cloudelastic-psi-eqiad, active_primary_shards: 706, active_shards: 1777 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:02:44] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:02:51] <icinga-wm>	 PROBLEM - Host flerovium is DOWN: PING CRITICAL - Packet loss = 100%
[00:02:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:02:53] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:02:55] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:02:57] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[00:02:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:02:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:03:01] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 334 threshold =0.15 breach: timed_out: False, cluster_name: cloudelastic-psi-eqiad, active_shards_percent_as_number: 84.2304060434372, unassigned_shards: 330, number_of_nodes: 3, active_primary_shards: 706, number_of_in_flight_fetch: 0, number_of_pending_tasks: 1, initializing_shards: 4, task_max_waiti
[00:03:01] <icinga-wm>	 s: 0, relocating_shards: 0, status: yellow, active_shards: 1784, delayed_unassigned_shards: 0, number_of_data_nodes: 3 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:03:03] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:03:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:03:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:03:09] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:03:11] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:03:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/announcemen
[00:03:15] <icinga-wm>	 uncements) is CRITICAL: Test Retrieve announcements returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:03:21] <icinga-wm>	 RECOVERY - Restbase root url on restbase1024 is OK: HTTP OK: HTTP/1.1 200 - 16254 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[00:03:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:03:27] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 359 threshold =0.15 breach: delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, relocating_shards: 0, active_primary_shards: 733, active_shards_percent_as_number: 75.51159618008185, initializing_shards: 12, number_of_pending_tasks: 0, cluster_name: cloudelastic-chi-eqiad, number_of_nodes: 3, nu
[00:03:27] <icinga-wm>	 s: 3, active_shards: 1107, status: yellow, timed_out: False, unassigned_shards: 347, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:03:27] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 359 threshold =0.15 breach: active_shards_percent_as_number: 75.51159618008185, number_of_nodes: 3, initializing_shards: 12, unassigned_shards: 347, timed_out: False, relocating_shards: 0, status: yellow, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, active_primary_shards: 733, activ
[00:03:27] <icinga-wm>	 luster_name: cloudelastic-chi-eqiad, number_of_data_nodes: 3, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:03:43] <icinga-wm>	 RECOVERY - Restbase root url on restbase1027 is OK: HTTP OK: HTTP/1.1 200 - 16254 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[00:03:43] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[00:03:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:03:50] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[00:03:53] <icinga-wm>	 RECOVERY - Restbase root url on restbase1022 is OK: HTTP OK: HTTP/1.1 200 - 16254 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[00:03:57] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, number_of_in_flight_fetch: 0, number_of_pending_tasks: 2, unassigned_shards: 311, number_of_nodes: 3, delayed_unassigned_shards: 0, status: yellow, number_of_data_nodes: 3, relocating_shards: 0, initializing_shards: 4, timed_out: False, task_max_waiting_in_queu
[00:03:57] <icinga-wm>	 ctive_shards_percent_as_number: 85.12747875354107, active_primary_shards: 706, active_shards: 1803 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:03:57] <icinga-wm>	 RECOVERY - Restbase root url on restbase1016 is OK: HTTP OK: HTTP/1.1 200 - 16254 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[00:04:01] <icinga-wm>	 PROBLEM - MediaWiki codfw memcached error rate on icinga1001 is CRITICAL: 6154 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[00:04:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:04:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:04:05] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 357 threshold =0.15 breach: active_shards_percent_as_number: 75.64802182810368, timed_out: False, number_of_pending_tasks: 3, active_primary_shards: 733, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, cluster_name: cloudelastic-chi-eqiad, initializing_shards: 12, unassigned_shards: 345, nu
[00:04:05] <icinga-wm>	  number_of_data_nodes: 3, task_max_waiting_in_queue_millis: 7711, active_shards: 1109, relocating_shards: 0, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:04:15] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2006:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2013:9536,cp2016:9536,cp2019:9536,cp2023:9536} site=codfw tunnel={cp1087_v4,cp1087_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[00:04:17] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.6409 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash
[00:04:19] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[00:04:21] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: status: yellow, task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, unassigned_shards: 305, number_of_nodes: 3, active_primary_shards: 706, number_of_data_nodes: 3, delayed_unassigned_shards: 0, timed_out: False, initializing_shards: 4, relocating_shards: 0, number_of_pending_tasks: 
[00:04:21] <icinga-wm>	 cloudelastic-psi-eqiad, active_shards: 1809, active_shards_percent_as_number: 85.41076487252126 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:04:23] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1229.eqiad.wmnet, mw1256.eqiad.wmnet, mw1232.eqiad.wmnet, mw1242.eqiad.wmnet, mw1240.eqiad.wmnet, mw1246.eqiad.wmnet, mw1226.eqiad.wmnet, mw1249.eqiad.wmnet, mw1222.eqiad.wmnet, mw1243.eqiad.wmnet, mw1225.eqiad.wmnet, mw1223.eqiad.wmnet, mw1241.eqiad.wmnet, mw1221.eqiad.wmnet, mw1257.eqiad.wmnet, mw1244.eqiad.wmnet, mw1238
[00:04:23] <icinga-wm>	 234.eqiad.wmnet, mw1224.eqiad.wmnet, mw1235.eqiad.wmnet, mw1231.eqiad.wmnet, mw1254.eqiad.wmnet, mw1252.eqiad.wmnet, mw1247.eqiad.wmnet, mw1239.eqiad.wmnet, mw1230.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[00:04:23] <icinga-wm>	 PROBLEM - AQS root url on aqs1009 is CRITICAL: connect to address 10.64.48.119 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[00:04:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:04:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:04:27] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[00:04:27] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[00:04:35] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen
[00:04:37] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[00:04:37] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status esams on icinga1001 is CRITICAL: instance={cp3030:9536,cp3032:9536,cp3033:9536,cp3040:9536,cp3041:9536,cp3042:9536,cp3043:9536} site=esams tunnel={cp1087_v4,cp1087_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[00:04:41] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: unassigned_shards: 295, cluster_name: cloudelastic-psi-eqiad, initializing_shards: 4, active_shards: 1819, active_shards_percent_as_number: 85.88290840415486, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, timed_out: False, task_max_waiting_in_queue_millis: 0, status: yellow, number_of
[00:04:41] <icinga-wm>	 elocating_shards: 0, active_primary_shards: 706, number_of_nodes: 3, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:04:47] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqsin on icinga1001 is CRITICAL: instance={cp5007:9536,cp5008:9536,cp5009:9536,cp5010:9536,cp5011:9536,cp5012:9536} site=eqsin tunnel={cp1087_v4,cp1087_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[00:04:49] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[00:04:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[00:04:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:04:53] <icinga-wm>	 PROBLEM - AQS root url on aqs1006 is CRITICAL: connect to address 10.64.48.146 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[00:05:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:05:07] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [utfa] https://wikitech.wikimedia.org/wiki/RESTBase
[00:05:13] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[00:05:21] <icinga-wm>	 PROBLEM - MediaWiki eqiad memcached error rate on icinga1001 is CRITICAL: 1.074e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:05:21] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[00:05:23] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[00:05:35] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: number_of_nodes: 3, number_of_data_nodes: 3, active_primary_shards: 693, status: yellow, number_of_pending_tasks: 1, unassigned_shards: 302, active_shards_percent_as_number: 85.28138528138528, timed_out: False, relocating_shards: 0, active_shards: 1773, task_max_waiting_in_queue_millis: 0, initia
[00:05:35] <icinga-wm>	  delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, cluster_name: cloudelastic-omega-eqiad https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:05:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:05:37] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: active_shards_percent_as_number: 85.37758537758538, number_of_nodes: 3, task_max_waiting_in_queue_millis: 0, timed_out: False, number_of_in_flight_fetch: 0, number_of_data_nodes: 3, active_shards: 1775, status: yellow, cluster_name: cloudelastic-omega-eqiad, initializing_shards: 4, active_primary
[00:05:37] <icinga-wm>	 ayed_unassigned_shards: 0, unassigned_shards: 300, relocating_shards: 0, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:05:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:05:47] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is CRITICAL: instance={cp4027:9536,cp4028:9536,cp4029:9536,cp4030:9536,cp4031:9536,cp4032:9536} site=ulsfo tunnel={cp1087_v4,cp1087_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[00:05:49] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[00:05:51] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[00:05:53] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[00:05:53] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[00:05:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:05:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:05:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:06:01] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:06:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:06:09] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[00:06:09] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[00:06:09] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [utfa] https://wikitech.wikimedia.org/wiki/RESTBase
[00:06:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:06:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:06:19] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[00:06:19] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[00:06:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:06:41] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: number_of_data_nodes: 3, unassigned_shards: 263, initializing_shards: 4, active_shards_percent_as_number: 87.15728715728716, number_of_pending_tasks: 0, status: yellow, number_of_in_flight_fetch: 0, active_primary_shards: 693, active_shards: 1812, relocating_shards: 0, number_of_nodes: 3, timed_o
[00:06:41] <icinga-wm>	 r_name: cloudelastic-omega-eqiad, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:06:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:07:07] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpect
[00:07:07] <icinga-wm>	 pecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:07:15] <icinga-wm>	 PROBLEM - MediaWiki codfw memcached error rate on icinga1001 is CRITICAL: 9580 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[00:07:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:07:17] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[00:07:19] <icinga-wm>	 PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:07:33] <icinga-wm>	 PROBLEM - IPsec on cp4030 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:07:37] <icinga-wm>	 PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:07:37] <icinga-wm>	 PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:07:37] <icinga-wm>	 PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:07:37] <icinga-wm>	 PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:07:45] <icinga-wm>	 PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:07:49] <icinga-wm>	 PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:07:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:07:53] <icinga-wm>	 PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:07:54] <icinga-wm>	 PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:07:54] <icinga-wm>	 PROBLEM - IPsec on cp5009 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:03] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:08:09] <icinga-wm>	 PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:09] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:08:13] <icinga-wm>	 PROBLEM - IPsec on cp4032 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:14] <icinga-wm>	 PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:14] <icinga-wm>	 PROBLEM - IPsec on cp4028 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:14] <icinga-wm>	 PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:23] <icinga-wm>	 PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:25] <icinga-wm>	 PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:25] <icinga-wm>	 PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:25] <icinga-wm>	 PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:27] <icinga-wm>	 PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:27] <icinga-wm>	 PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:29] <icinga-wm>	 PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a r
[00:08:29] <icinga-wm>	 ved https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[00:08:31] <icinga-wm>	 PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:40] <icinga-wm>	 PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:51] <icinga-wm>	 PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:53] <icinga-wm>	 PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:55] <icinga-wm>	 PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:08:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:08:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:09:01] <icinga-wm>	 PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:09:05] <icinga-wm>	 PROBLEM - IPsec on cp5008 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1087_v4, cp1087_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[00:09:11] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash
[00:09:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:09:27] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen
[00:09:29] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:10:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:10:03] <icinga-wm>	 RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[00:10:33] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:10:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:10:59] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:10:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:11:13] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:11:21] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:11:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:11:39] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage, utfa] https://wikitech.wikimedia.org/wiki/RESTBase
[00:11:43] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] https://wikitech.wikimedia.org/wiki/RESTBase
[00:12:07] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on labstore1007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops
[00:12:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) is CRITICAL: Test Get mobile-html from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:12:19] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:12:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:12:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:12:55] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[00:12:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:12:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:13:19] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[00:13:27] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:13:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:14:01] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic={rsyslog-err,rsyslog-info,rsyslog-notice,rsyslog-warning,udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimed
[00:14:01] <icinga-wm>	 4/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[00:14:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:14:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobile
[00:14:19] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] https://wikitech.wikimedia.org/wiki/RESTBase
[00:14:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:14:27] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[00:14:31] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[00:14:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:15:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:15:21] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:15:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:15:45] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:15:57] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) 
[00:15:57] <icinga-wm>	 wikimedia.org/wiki/RESTBase
[00:16:09] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:16:51] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:17:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:17:10] <librenms-wmf>	 04Critical Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Juniper alarm active
[00:18:03] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[00:18:21] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is C
[00:18:21] <icinga-wm>	 rieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:19:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:19:07] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[00:19:07] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:19:17] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[00:19:27] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 20
[00:19:27] <icinga-wm>	 page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:20:05] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 50
[00:20:05] <icinga-wm>	 ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:20:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:20:27] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/random/titl
[00:20:27] <icinga-wm>	 dom article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:20:27] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:20:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:20:33] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpect
[00:20:33] <icinga-wm>	 pecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:20:43] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[00:20:47] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[00:20:53] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[00:20:57] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:21:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the
[00:21:01] <icinga-wm>	 s 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:21:47] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:21:47] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/
[00:21:47] <icinga-wm>	 article.creation.morelike - bad article title) is CRITICAL: Test article.creation.morelike - bad article title returned the unexpected status 504 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:21:57] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:22:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Mon
[00:22:11] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:22:21] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:22:25] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:22:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:22:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia
[00:22:41] <icinga-wm>	 s/Monitoring/restbase
[00:23:29] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:23:37] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[00:23:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:23:41] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 504 (expecting: 200) https://wiki
[00:23:41] <icinga-wm>	 g/wiki/Services/Monitoring/recommendation_api
[00:23:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:24:07] <icinga-wm>	 PROBLEM - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:24:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:24:53] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:25:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:25:15] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:25:15] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:25:21] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[00:25:31] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:25:41] <icinga-wm>	 RECOVERY - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:25:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:25:47] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:25:55] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:26:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:26:33] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_primary_shards: 733, timed_out: False, number_of_data_nodes: 3, task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, cluster_name: cloudelastic-chi-eqiad, status: yellow, number_of_nodes: 3, active_shards_percent_as_number: 85.19781718963165, active_shards: 1249, number_of_pend
[00:26:33] <icinga-wm>	 ssigned_shards: 205, relocating_shards: 0, delayed_unassigned_shards: 0, initializing_shards: 12 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:26:33] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:26:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:26:39] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[00:27:05] <icinga-wm>	 PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[00:27:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:27:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:27:17] <wikibugs>	 (03PS1) 10Bstorm: labstore: fail over labstore1007 and cloudstore1008 for network issue [puppet] - 10https://gerrit.wikimedia.org/r/538351
[00:27:33] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_nodes: 3, cluster_name: cloudelastic-chi-eqiad, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.67530695770806, number_of_in_flight_fetch: 0, unassigned_shards: 198, timed_out: False, initializing_shards: 12, relocating_shards: 0, status: yellow, delayed_unassigne
[00:27:33] <icinga-wm>	 er_of_data_nodes: 3, active_shards: 1256, active_primary_shards: 733, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:27:33] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_shards: 1256, number_of_nodes: 3, relocating_shards: 0, active_primary_shards: 733, status: yellow, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, initializing_shards: 12, cluster_name: cloudelastic-chi-eqiad, number_of_pending_tasks: 0, time
[00:27:33] <icinga-wm>	 ive_shards_percent_as_number: 85.67530695770806, unassigned_shards: 198, number_of_data_nodes: 3 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:28:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:28:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:28:51] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:29:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:29:43] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:29:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:30:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:30:13] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:30:23] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [utfa] https://wikitech.wikimedia.org/wiki/RESTBase
[00:30:29] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[00:30:29] <icinga-wm>	 PROBLEM - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:30:34] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:30:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:31:07] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:31:19] <icinga-wm>	 PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[00:31:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:31:25] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:31:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:31:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:31:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:32:07] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:32:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:32:51] <icinga-wm>	 RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[00:32:57] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:33:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:33:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:33:39] <icinga-wm>	 PROBLEM - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:33:41] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} 
[00:33:41] <icinga-wm>	 .morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:33:55] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobile
[00:34:23] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:34:27] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:34:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:34:33] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:34:45] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:34:50] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] labstore: fail over labstore1007 and cloudstore1008 for network issue [puppet] - 10https://gerrit.wikimedia.org/r/538351 (owner: 10Bstorm)
[00:34:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:35:19] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 50
[00:35:19] <icinga-wm>	 ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:35:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:35:23] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:35:25] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:35:37] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:35:51] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/fe
[00:35:51] <icinga-wm>	 nth}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:36:01] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[00:36:13] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:36:13] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:37:11] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[00:38:35] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp1087.eqiad.wmnet
[00:38:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:43] <bblack>	 !log depooling confctl things in rack D2
[00:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:45] <stashbot>	 D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2
[00:38:53] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:39:00] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp1088.eqiad.wmnet
[00:39:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:32] <razesoldier>	 Does anyone know what happened to 503 error that just happened?
[00:39:40] <bblack>	 see topic!
[00:39:43] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[00:39:51] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:39:57] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:40:05] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[00:40:29] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[00:40:57] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:40:57] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:41:17] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:41:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:41:33] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[00:41:35] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:42:33] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 504 
[00:42:34] <icinga-wm>	  /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:42:35] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[00:42:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:43:59] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:44:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:44:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:44:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:45:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:45:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:46:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 504 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:46:21] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 50
[00:46:21] <icinga-wm>	 ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:46:27] <shdubsh>	 !log restarting logstash on logstash1008 without udp-localhost-eqiad/codfw configs
[00:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:29] <icinga-wm>	 RECOVERY - AQS root url on aqs1005 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[00:46:29] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returne
[00:46:29] <icinga-wm>	 status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) is CRITICAL: Test article.creation.morelike - bad article title returned the unexpected status 504 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:46:35] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:47:04] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:47:23] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:47:25] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:47:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:48:13] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:48:16] <mutante>	 !log aqs1005 - systemctl restart aqs
[00:48:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:29] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:48:47] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[00:48:59] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[00:49:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Mon
[00:49:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:49:31] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[00:50:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:50:41] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:51:01] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[00:51:41] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[00:51:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 20
[00:51:53] <icinga-wm>	 ch.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:52:11] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:52:13] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:52:37] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[00:52:37] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[00:52:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:52:51] <icinga-wm>	 RECOVERY - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:52:55] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[00:53:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[00:53:03] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:53:15] <icinga-wm>	 RECOVERY - AQS root url on aqs1006 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[00:53:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article t
[00:53:57] <icinga-wm>	  unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:54:09] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returne
[00:54:09] <icinga-wm>	 status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:54:13] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 504 (expecting: 404): /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected
[00:54:13] <icinga-wm>	 cting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:54:17] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:54:22] <mutante>	 !log aqs1006 - systemctl restart aqs
[00:54:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:25] <mutante>	 !log aqs1009 - systemctl restart aqs
[00:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:51] <icinga-wm>	 RECOVERY - AQS root url on aqs1009 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[00:55:03] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[00:56:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[00:56:35] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[00:56:41] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:56:59] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:57:33] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 50
[00:57:33] <icinga-wm>	 ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:57:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returne
[00:57:35] <icinga-wm>	 status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:57:47] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) https://wikitech
[00:57:47] <icinga-wm>	 ki/Mobileapps_%28service%29
[00:58:05] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[00:58:59] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:59:05] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:59:23] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:59:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:59:37] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:59:41] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[01:00:11] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:00:13] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:00:17] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:00:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:01:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:01:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:01:25] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:01:29] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:01:49] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) r
[01:01:49] <icinga-wm>	 ected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:02:11] <icinga-wm>	 PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[01:02:13] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] https://wikitech.wikimedia.org/wiki/RESTBase
[01:02:23] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[01:02:39] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:02:41] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:03:01] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[01:03:30] <wikibugs>	 10Operations, 10Performance-Team, 10observability, 10serviceops: Ensure graphs used by Performance account for Varnish-to-ATS migration - https://phabricator.wikimedia.org/T233474 (10Krinkle)
[01:03:45] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[01:03:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:03:51] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[01:04:03] <wikibugs>	 10Operations, 10Performance-Team, 10observability, 10serviceops: Ensure graphs used by Performance account for Varnish-to-ATS migration - https://phabricator.wikimedia.org/T233474 (10Krinkle) I've added the ones I use most frequently. I'm probably missing others. To be added after Monday's meeting.
[01:04:21] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expec
[01:04:21] <icinga-wm>	 ain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:05:41] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:05:47] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:05:47] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:06:39] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:06:43] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:06:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:07:03] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/
[01:07:04] <icinga-wm>	 itoring/recommendation_api
[01:07:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article t
[01:07:11] <icinga-wm>	  unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:07:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:07:27] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:07:33] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:07:47] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] https://wikitech.wikimedia.org/wiki/RESTBase
[01:08:23] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:08:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:08:37] <shdubsh>	 !log removed input-kafka-rsyslog-shipper-eqiad/codfw from logstash inputs logstash1008 and logstash1009
[01:08:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:08:51] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] https://wikitech.wikimedia.org/wiki/RESTBase
[01:09:09] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[01:09:15] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:09:53] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article t
[01:09:53] <icinga-wm>	  unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:10:15] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:10:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:35] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:10:51] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[01:11:21] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:11:23] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpect
[01:11:23] <icinga-wm>	 pecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:11:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:11:49] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[01:11:59] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:12:01] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[01:12:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:04] <wikibugs>	 (03PS4) 10Jforrester: [WIP] Provide for YAML-based inherited configuration to eventually replace InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129
[01:13:06] <wikibugs>	 (03PS1) 10Jforrester: [WiP] YAML files for every wiki, and a basic inheritance tree [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538354
[01:13:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/
[01:13:13] <icinga-wm>	 apps
[01:13:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpe
[01:13:14] <icinga-wm>	 expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:47] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:13:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:13:55] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:13:57] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:14:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/S
[01:14:09] <icinga-wm>	 g/restbase
[01:14:21] <shdubsh>	 !log temporarily removing input-kafka-rsyslog-shipper-eqiad/codfw from logstash1007
[01:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:14:37] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:15:01] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:15:13] <icinga-wm>	 PROBLEM - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:15:15] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:15:17] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[01:15:23] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:15:31] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:15:33] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[01:15:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:16:17] <musikanimal>	 hello! I know it's late and on Friday, but has anyone had a look at https://phabricator.wikimedia.org/T233271 ?
[01:16:39] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[01:16:39] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[01:16:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:16:53] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[01:17:03] <musikanimal>	 I don't know how to use grafana well, I wonder if there is a spike in 503's. I'm now getting a bunch of automated error emails from XTools, it's getting 503s from the action API and even the pageviews API
[01:17:05] <bblack>	 musikanimal: see topic - ongoing minor internal network issues with some minimal fallout (some clusters are not handling automatic redundancy like they should)
[01:17:11] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[01:17:19] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:17:29] <bblack>	 we're working on the physical fix, and I'm trying to dig a little on why this is causing unexpected higher-level impacts
[01:17:30] <musikanimal>	 oh thanks bblack. There is https://phabricator.wikimedia.org/T233271 by the way, not sure if there's another task
[01:17:31] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[01:17:43] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 261, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:17:49] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:17:53] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: T
[01:17:54] <icinga-wm>	 ured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:17:57] <icinga-wm>	 PROBLEM - configured eth on lvs1013 is CRITICAL: enp5s0f1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[01:18:15] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[01:18:21] <icinga-wm>	 PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[01:18:29] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 3 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[01:18:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:18:49] <wikibugs>	 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10BBlack) We have a known ongoing incident with a bad network switch in our eqiad (Virginia) datacenter.  We're working on fixing that root problem.  In theory all affected clusters should've had the redundanc...
[01:18:55] <icinga-wm>	 PROBLEM - configured eth on lvs1015 is CRITICAL: enp5s0f1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[01:19:09] <icinga-wm>	 PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[01:19:31] <icinga-wm>	 RECOVERY - configured eth on lvs1013 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[01:19:31] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:19:35] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:19:35] <icinga-wm>	 RECOVERY - Host cloudelastic1004 is UP: PING OK - Packet loss = 16%, RTA = 0.23 ms
[01:19:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:19:37] <icinga-wm>	 RECOVERY - Host labstore1007 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[01:19:37] <icinga-wm>	 RECOVERY - Host cp1087 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[01:19:37] <icinga-wm>	 RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:19:39] <icinga-wm>	 RECOVERY - Host cp1088 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[01:19:39] <icinga-wm>	 RECOVERY - Host ms-be1048 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[01:19:39] <icinga-wm>	 RECOVERY - Host analytics1076 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[01:19:39] <icinga-wm>	 RECOVERY - IPsec on cp5008 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:19:41] <icinga-wm>	 RECOVERY - Host ms-be1043 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[01:19:41] <icinga-wm>	 RECOVERY - Host an-worker1093 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[01:19:41] <icinga-wm>	 RECOVERY - Host backup1001 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms
[01:19:41] <icinga-wm>	 RECOVERY - IPsec on cp4030 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:19:45] <icinga-wm>	 RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:19:45] <icinga-wm>	 RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:19:45] <icinga-wm>	 RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:19:45] <icinga-wm>	 RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:19:51] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:19:51] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:19:57] <icinga-wm>	 RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
[01:19:57] <icinga-wm>	 RECOVERY - Host an-worker1092 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[01:19:59] <icinga-wm>	 RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:01] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 22 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[01:20:01] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:20:01] <icinga-wm>	 RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:03] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:20:03] <icinga-wm>	 RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:07] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:20:09] <icinga-wm>	 RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:09] <icinga-wm>	 RECOVERY - IPsec on cp5009 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:15] <icinga-wm>	 RECOVERY - Host flerovium is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[01:20:19] <icinga-wm>	 RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:21] <icinga-wm>	 RECOVERY - IPsec on cp4032 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:23] <icinga-wm>	 RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:24] <icinga-wm>	 RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:24] <icinga-wm>	 RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:24] <icinga-wm>	 RECOVERY - IPsec on cp4028 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:24] <icinga-wm>	 RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:25] <icinga-wm>	 RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:25] <icinga-wm>	 RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:27] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:20:27] <icinga-wm>	 RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:33] <icinga-wm>	 RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:33] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[01:20:34] <icinga-wm>	 RECOVERY - configured eth on lvs1015 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[01:20:39] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[01:20:41] <icinga-wm>	 RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:41] <icinga-wm>	 RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:43] <icinga-wm>	 RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[01:20:53] <icinga-wm>	 RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:54] <icinga-wm>	 RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:20:59] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:20:59] <icinga-wm>	 RECOVERY - IPsec on cp4027 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:21:01] <icinga-wm>	 RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[01:21:01] <bblack>	 !log re-pooling cp108[78] in D2 via confctl
[01:21:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:04] <stashbot>	 D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2
[01:21:05] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:21:05] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[01:21:07] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[01:21:09] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:21:15] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp1088.eqiad.wmnet
[01:21:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:21] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp1087.eqiad.wmnet
[01:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:33] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[01:21:51] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[01:22:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:22:09] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:22:41] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[01:23:03] <icinga-wm>	 RECOVERY - MediaWiki codfw exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[01:23:17] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:23:37] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:23:53] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[01:24:57] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[01:25:05] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:25:11] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[01:25:51] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:27:49] <icinga-wm>	 PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[01:29:41] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:30:07] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:30:59] <icinga-wm>	 RECOVERY - MediaWiki codfw exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[01:31:29] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:31:33] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:31:43] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:32:11] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:32:13] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:32:15] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:33:07] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[01:33:45] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:33:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:33:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:33:49] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:34:13] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:34:21] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retri
[01:34:21] <icinga-wm>	 featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:34:27] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for tes
[01:34:27] <icinga-wm>	 he unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:34:29] <mutante>	 !log restarted mobileapps service on scb1001
[01:34:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:34:45] <mutante>	 !log restarting mobileapps service on scb* 
[01:34:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:35:09] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:35:21] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:35:25] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:35:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:35:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:35:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:37:53] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:38:55] <icinga-wm>	 PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[01:39:29] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[01:39:37] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:40:29] <icinga-wm>	 RECOVERY - MediaWiki codfw exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[01:41:11] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[01:45:11] <icinga-wm>	 PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[01:47:27] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[01:48:21] <icinga-wm>	 PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[01:48:31] <icinga-wm>	 ACKNOWLEDGEMENT - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 347.2 ge 210 cole_white logstash has a huge backlog to work through https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[01:48:31] <icinga-wm>	 ACKNOWLEDGEMENT - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw cole_white logstash has a huge backlog to work through https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[01:48:31] <icinga-wm>	 ACKNOWLEDGEMENT - MediaWiki codfw memcached error rate on icinga1001 is CRITICAL: 3.96e+04 gt 5000 cole_white logstash has a huge backlog to work through https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[01:48:31] <icinga-wm>	 ACKNOWLEDGEMENT - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad cole_white logstash has a huge backlog to work through https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:48:31] <icinga-wm>	 ACKNOWLEDGEMENT - MediaWiki eqiad memcached error rate on icinga1001 is CRITICAL: 1.004e+05 gt 5000 cole_white logstash has a huge backlog to work through https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:48:32] <icinga-wm>	 ACKNOWLEDGEMENT - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic={rsyslog-err,rsyslog-info,rsyslog-notice,rsyslog-warning,udp_localhost-err,udp_localhost-info,udp_localhost-warning} cole_white logstash has a huge backlog to work through https://wikitech.wikimedia
[01:48:32] <icinga-wm>	 h%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[01:49:01] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[01:51:10] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Juniper alarm active
[01:52:41] <shdubsh>	 !log temporarily removing input-kafka-rsyslog-shipper-eqiad/codfw from logstash2004-5-6
[01:52:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:55:01] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on labstore1007 is CRITICAL: cluster=wmcs device={cciss,14,cciss,15,cciss,16,cciss,17,cciss,18,cciss,19,cciss,20,cciss,21,cciss,22,cciss,23} instance=labstore1007:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops
[02:01:34] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on labstore1007 is CRITICAL: cluster=wmcs device={cciss,14,cciss,15,cciss,16,cciss,17,cciss,18,cciss,19,cciss,20,cciss,21,cciss,22,cciss,23} instance=labstore1007:9100 job=node site=eqiad Brandon Black https://phabricator.wikimedia.org/T199248 - I think we lost an existing ACK here from network troubles this evening https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wik
[02:01:34] <icinga-wm>	 ard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops
[02:11:15] <wikibugs>	 (03CR) 10Cwhite: initial commit (036 comments) [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite)
[02:14:02] <bblack>	 !log dbproxy1016: executing "systemctl reload haproxy" to recover from false healthcheck failure (network issues) on master
[02:14:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:14:59] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[02:15:48] <bblack>	 !log dbproxy1017: executing "systemctl reload haproxy" to recover from false healthcheck failure (network issues) on master
[02:15:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:16:23] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[02:41:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 46703896 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:43:01] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4648 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:54:45] <icinga-wm>	 RECOVERY - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:07:54] <icinga-wm>	 RECOVERY - MediaWiki eqiad memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 13 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:18:57] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:40:31] <icinga-wm>	 RECOVERY - MediaWiki codfw exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[04:42:46] <wikibugs>	 (03PS12) 10Jeena Huneidi: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935)
[04:57:53] <icinga-wm>	 PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[05:01:03] <icinga-wm>	 RECOVERY - MediaWiki codfw exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[05:06:41] <icinga-wm>	 RECOVERY - MediaWiki codfw memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 7 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[05:11:13] <icinga-wm>	 PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[05:11:27] <icinga-wm>	 PROBLEM - MediaWiki codfw memcached error rate on icinga1001 is CRITICAL: 1.431e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[05:11:47] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[05:12:49] <icinga-wm>	 RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash
[05:13:21] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[05:20:53] <icinga-wm>	 PROBLEM - MediaWiki codfw memcached error rate on icinga1001 is CRITICAL: 1.326e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[05:24:39] <icinga-wm>	 PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash
[05:25:37] <icinga-wm>	 PROBLEM - MediaWiki codfw memcached error rate on icinga1001 is CRITICAL: 7784 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[05:26:13] <icinga-wm>	 RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash
[05:31:53] <icinga-wm>	 RECOVERY - MediaWiki codfw memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 1 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[05:33:52] <shdubsh>	 !log drop input-kafka-rsyslog-shipper in codfw
[05:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:35:03] <icinga-wm>	 PROBLEM - MediaWiki codfw memcached error rate on icinga1001 is CRITICAL: 2.227e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[05:42:14] <shdubsh>	 !log re-enable input-kafka-rsyslog-shipper in codfw
[05:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:52:29] <icinga-wm>	 RECOVERY - MediaWiki codfw memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 0 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[06:07:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:08:29] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[06:09:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:53:24] <icinga-wm>	 RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:54:27] <elukey>	 just cleaned up a old unit --^
[08:54:55] <elukey>	 about cassandra-daily-coord-local_group_default_T_mediarequest_per_file I guess that all those alerts are false positives right?
[08:55:04] <elukey>	 err sorry wrong chan :)
[09:26:57] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed 
[09:26:57] <icinga-wm>	 onse was received https://wikitech.wikimedia.org/wiki/RESTBase
[09:28:27] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[10:12:17] <wikibugs>	 (03PS1) 10ArielGlenn: set start date for partial dumps back to normal [puppet] - 10https://gerrit.wikimedia.org/r/538370 (https://phabricator.wikimedia.org/T233276)
[10:13:06] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] set start date for partial dumps back to normal [puppet] - 10https://gerrit.wikimedia.org/r/538370 (https://phabricator.wikimedia.org/T233276) (owner: 10ArielGlenn)
[10:24:07] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[10:28:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[10:36:47] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[10:39:59] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[10:51:11] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[10:52:47] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:10:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:11:55] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[11:19:49] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:27:39] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[11:30:55] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[12:50:29] <wikibugs>	 (03PS1) 10KartikMistry: apertium-dan-nor: New upstream release [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/538379 (https://phabricator.wikimedia.org/T218184)
[12:52:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-dan-nor: New upstream release [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/538379 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry)
[13:16:07] <wikibugs>	 (03PS1) 10KartikMistry: apertium-nno-nob: New upstream release [debs/contenttranslation/apertium-nno-nob] - 10https://gerrit.wikimedia.org/r/538380 (https://phabricator.wikimedia.org/T218184)
[13:16:35] <wikibugs>	 (03PS2) 10KartikMistry: apertium-dan-nor: New upstream release [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/538379 (https://phabricator.wikimedia.org/T218184)
[13:44:43] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Ops Group for papaul@ - https://phabricator.wikimedia.org/T233189 (10faidon) This is approved.  I'd like to see a separate task where (all of) #DC-Ops document all the commands they use `sudo` or elevated privileges in general (e.g...
[14:22:24] <wikibugs>	 (03PS1) 10KartikMistry: apertium-swe-dan: New upstream release [debs/contenttranslation/apertium-swe-dan] - 10https://gerrit.wikimedia.org/r/538381 (https://phabricator.wikimedia.org/T218184)
[14:27:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "We need to review/add grants for this new host on the DB" [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn)
[14:46:56] <wikibugs>	 (03PS1) 10KartikMistry: apertium-swe-nor: New upstream release [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/538382 (https://phabricator.wikimedia.org/T218184)
[14:52:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-swe-nor: New upstream release [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/538382 (https://phabricator.wikimedia.org/T218184) (owner: 10KartikMistry)
[15:31:52] <wikibugs>	 (03PS2) 10KartikMistry: apertium-swe-nor: New upstream release [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/538382 (https://phabricator.wikimedia.org/T218184)
[17:26:25] <wikibugs>	 10Operations, 10Acme-chief, 10Traffic: Benefit from acme-chief features in acme-chief clients - https://phabricator.wikimedia.org/T220359 (10Krenair)
[18:11:49] <icinga-wm>	 PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[18:13:23] <icinga-wm>	 RECOVERY - MediaWiki codfw exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[19:01:45] <wikibugs>	 (03Restored) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) (owner: 10Zoranzoki21)
[19:01:55] <wikibugs>	 (03PS25) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151)
[19:02:14] <wikibugs>	 (03PS26) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151)
[23:59:00] <wikibugs>	 10Puppet, 10Cloud-Services, 10Phabricator, 10cloud-services-team (Kanban): puppet function ipresolve unable to look up instance on labs-puppetmaster - https://phabricator.wikimedia.org/T139011 (10Krenair) 05Open→03Invalid Presumed no longer relevant following {T171188}