[00:00:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:02:54] (03CR) 10Reedy: [C: 03+1] Use MediaWikiServices::getAuthManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585910 (owner: 10Umherirrender) [00:04:54] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.0 200 OK - 22359 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [00:07:22] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is OK: HTTP OK: HTTP/1.0 200 OK - 22372 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [00:29:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:31:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:38:32] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:42:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:56:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:58:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:23:32] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [01:27:42] PROBLEM - Host restbase1025 is DOWN: PING CRITICAL - Packet loss = 100% [01:30:14] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:30:30] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:30:30] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:30:30] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:30:30] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:30:52] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:31:14] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:31:24] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [01:31:24] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:31:24] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:31:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:31:56] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:32:14] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:32:14] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:32:40] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:32:58] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:33:06] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [01:33:10] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:33:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:34:02] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:34:06] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:34:56] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:36:08] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is OK: HTTP OK: HTTP/1.0 200 OK - 22371 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:34:56] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [03:36:48] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [03:36:54] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [03:40:28] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [03:41:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:45:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:54:34] PROBLEM - snapshot of s5 in eqiad on db1115 is CRITICAL: snapshot for s5 at eqiad taken more than 3 days ago: Most recent backup 2020-04-09 03:41:37 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:12:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:14:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:17:18] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [04:17:20] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [04:24:40] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [04:25:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:26:28] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [04:27:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:31:50] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:34:02] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:44:38] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is OK: HTTP OK: HTTP/1.0 200 OK - 22377 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:44:58] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.0 200 OK - 22373 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:55:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:57:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:06:29] Just got Request from - via cp1075.eqiad.wmnet, ATS/8.0.6 [05:06:30] Error: 502, Next Hop Connection Failed at 2020-04-12 05:05:15 GMT on one page and "upstream connect error or disconnect/reset before headers. reset reason: overflow" on another [05:06:34] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [05:06:40] * AntiComposite takes this as a sign to go to sleep [05:06:45] AntiComposite: Me as well [05:06:56] PROBLEM - PHP7 rendering on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:06:56] PROBLEM - PHP7 rendering on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:06:56] PROBLEM - Nginx local proxy to apache on mw1391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:06:56] PROBLEM - Apache HTTP on mw1266 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:06:56] PROBLEM - Nginx local proxy to apache on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:06:57] PROBLEM - Nginx local proxy to apache on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:06:58] 504 Gateway Time-out on enwiki [05:06:58] PROBLEM - Nginx local proxy to apache on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:06:58] PROBLEM - Nginx local proxy to apache on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:06:58] PROBLEM - Nginx local proxy to apache on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:06:59] PROBLEM - Nginx local proxy to apache on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:06:59] PROBLEM - Nginx local proxy to apache on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:00] PROBLEM - Nginx local proxy to apache on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:00] PROBLEM - Nginx local proxy to apache on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:01] PROBLEM - Nginx local proxy to apache on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:01] PROBLEM - PHP7 rendering on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:08] PROBLEM - PHP7 rendering on mw1391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:08] PROBLEM - PHP7 rendering on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:08] PROBLEM - Apache HTTP on mw1409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:08] PROBLEM - Apache HTTP on mw1389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:08] PROBLEM - Apache HTTP on mw1411 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:09] PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:10] PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:12] PROBLEM - PHP7 rendering on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:12] PROBLEM - PHP7 rendering on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:12] PROBLEM - PHP7 rendering on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:12] PROBLEM - PHP7 rendering on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:12] PROBLEM - PHP7 rendering on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:14] PROBLEM - PHP7 rendering on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:14] PROBLEM - PHP7 rendering on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:14] PROBLEM - PHP7 rendering on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:14] PROBLEM - PHP7 rendering on mw1264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:16] PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:16] PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:16] PROBLEM - Apache HTTP on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:16] PROBLEM - Apache HTTP on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:17] PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:17] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:18] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:18] PROBLEM - Apache HTTP on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:19] PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:19] PROBLEM - Nginx local proxy to apache on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:20] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:20] PROBLEM - Apache HTTP on mw1264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:21] PROBLEM - Nginx local proxy to apache on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:21] PROBLEM - Nginx local proxy to apache on mw1409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:22] PROBLEM - Nginx local proxy to apache on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:22] PROBLEM - Nginx local proxy to apache on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:23] PROBLEM - Nginx local proxy to apache on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:23] PROBLEM - Nginx local proxy to apache on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:24] PROBLEM - Nginx local proxy to apache on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:24] PROBLEM - Nginx local proxy to apache on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:25] PROBLEM - PHP7 rendering on mw1389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:25] PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:26] PROBLEM - PHP7 rendering on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:26] PROBLEM - PHP7 rendering on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:27] PROBLEM - PHP7 rendering on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:27] PROBLEM - PHP7 rendering on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:28] PROBLEM - PHP7 rendering on mw1409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:28] PROBLEM - PHP7 rendering on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:29] PROBLEM - PHP7 rendering on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:29] PROBLEM - PHP7 rendering on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:30] PROBLEM - PHP7 rendering on mw1265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:30] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 8188 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:07:31] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:31] PROBLEM - PHP7 rendering on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:32] PROBLEM - PHP7 rendering on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:32] PROBLEM - PHP7 rendering on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:36] PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:36] PROBLEM - Nginx local proxy to apache on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:36] PROBLEM - Apache HTTP on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:36] PROBLEM - PHP7 rendering on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:38] PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:40] PROBLEM - Nginx local proxy to apache on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:40] PROBLEM - Nginx local proxy to apache on mw1266 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:40] PROBLEM - PHP7 rendering on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:42] PROBLEM - PHP7 rendering on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:42] PROBLEM - PHP7 rendering on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:42] PROBLEM - PHP7 rendering on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:42] PROBLEM - PHP7 rendering on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:44] PROBLEM - PHP7 rendering on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:44] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [05:07:44] PROBLEM - PHP7 rendering on mw1266 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:44] PROBLEM - Apache HTTP on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:46] PROBLEM - PHP7 rendering on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:46] PROBLEM - Apache HTTP on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:46] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:46] PROBLEM - Apache HTTP on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:46] PROBLEM - Apache HTTP on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:48] PROBLEM - Apache HTTP on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:48] PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:48] PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:48] PROBLEM - Apache HTTP on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:49] PROBLEM - Apache HTTP on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:49] PROBLEM - Apache HTTP on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:50] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1265.eqiad.wmnet, mw1331.eqiad.wmnet, mw1395.eqiad.wmnet, mw1365.eqiad.wmnet, mw1367.eqiad.wmnet, mw1267.eqiad.wmnet, mw1330.eqiad.wmnet, mw1366.eqiad.wmnet, mw1322.eqiad.wmnet, mw1333.eqiad.wmnet, mw1323.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1350.eqiad.wmnet, mw1327.eqiad.wmnet, mw1261.eqiad. [05:07:50] ad.wmnet, mw1364.eqiad.wmnet, mw1407.eqiad.wmnet, mw1405.eqiad.wmnet, mw1351.eqiad.wmnet, mw1263.eqiad.wmnet, mw1320.eqiad.wmnet, mw1329.eqiad.wmnet, mw1269.eqiad.wmnet, mw1352.eqiad.wmnet, mw1264.eqiad.wmnet, mw1399.eqiad.wmnet, mw1355.eqiad.wmnet, mw1326.eqiad.wmnet, mw1268.eqiad.wmnet, mw1371.eqiad.wmnet, mw1319.eqiad.wmnet, mw1393.eqiad.wmnet, mw1373.eqiad.wmnet, mw1324.eqiad.wmnet, mw1353.eqiad.wmnet, mw1372.eqiad.wmnet, mw1 [05:07:51] mw1370.eqiad.wmnet, mw1403.eqiad.wmnet, mw1389.eqiad.wmnet, mw1274.eqiad.wmnet, mw1266.eqiad.wmnet, mw1271.eqiad.wmnet, mw1387.eqiad.wmnet, mw1321.eqiad.wmnet, mw1401.eqiad.wmnet, mw139 https://wikitech.wikimedia.org/wiki/PyBal [05:07:52] PROBLEM - PHP7 rendering on mw1411 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:52] PROBLEM - PHP7 rendering on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:54] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:07:54] PROBLEM - PHP7 rendering on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:55] PROBLEM - PHP7 rendering on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:55] PROBLEM - PHP7 rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:56] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:07:56] PROBLEM - PHP7 rendering on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:57] PROBLEM - PHP7 rendering on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:57] PROBLEM - PHP7 rendering on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:57] PROBLEM - PHP7 rendering on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:07:58] PROBLEM - Apache HTTP on mw1391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:58] PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:58] PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:58] PROBLEM - Nginx local proxy to apache on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:07:59] PROBLEM - Apache HTTP on mw1263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:00] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1265.eqiad.wmnet, mw1371.eqiad.wmnet, mw1365.eqiad.wmnet, mw1367.eqiad.wmnet, mw1267.eqiad.wmnet, mw1330.eqiad.wmnet, mw1322.eqiad.wmnet, mw1355.eqiad.wmnet, mw1323.eqiad.wmnet, mw1384.eqiad.wmnet, mw1327.eqiad.wmnet, mw1351.eqiad.wmnet, mw1413.eqiad.wmnet, mw1364.eqiad.wmnet, mw1354.eqiad.wmnet, mw1272.eqiad.wmnet, mw1263 [05:08:00] 274.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1352.eqiad.wmnet, mw1264.eqiad.wmnet, mw1399.eqiad.wmnet, mw1266.eqiad.wmnet, mw1391.eqiad.wmnet, mw1321.eqiad.wmnet, mw1328.eqiad.wmnet, mw1333.eqiad.wmnet, mw1393.eqiad.wmnet, mw1366.eqiad.wmnet, mw1349.eqiad.wmnet, mw1269.eqiad.wmnet, mw1372.eqiad.wmnet, mw1350.eqiad.wmnet, mw1370.eqiad.wmnet, mw1397.eqiad.wmnet, mw1389.eqiad.wmnet, mw1331.eqiad.wmn [05:08:00] wmnet, mw1271.eqiad.wmnet, mw1387.eqiad.wmnet, mw1268.eqiad.wmnet, mw1395.eqiad.wmnet, mw1 https://wikitech.wikimedia.org/wiki/PyBal [05:08:01] PROBLEM - Nginx local proxy to apache on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:01] PROBLEM - Nginx local proxy to apache on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:02] PROBLEM - LVS HTTPS IPv4 #page on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:08:02] PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:03] PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:04] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw1265.eqiad.wmnet, mw1371.eqiad.wmnet, mw1395.eqiad.wmnet, mw1365.eqiad.wmnet, mw1367.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1270.eqiad.wmnet, mw1331.eqiad.wmnet, mw1355.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1272.eqiad.wmnet, mw1387.eqiad.wmnet, mw1364.eqiad.wmnet, mw1407.eqiad.wmnet, mw1 [05:08:04] mw1263.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1269.eqiad.wmnet, mw1271.eqiad.wmnet, mw1264.eqiad.wmnet, mw1266.eqiad.wmnet, mw1391.eqiad.wmnet, mw1321.eqiad.wmnet, mw1333.eqiad.wmnet, mw1393.eqiad.wmnet, mw1366.eqiad.wmnet, mw1324.eqiad.wmnet, mw1350.eqiad.wmnet, mw1389.eqiad.wmnet, mw1320.eqiad.wmnet, mw1319.eqiad.wmnet, mw1352.eqiad.wmnet, mw1268.eqiad.wmnet, mw1401.eqiad.wmnet, mw1403.eqiad.wmnet, mw1325.eqiad. [05:08:04] ad.wmnet, mw1409.eqiad.wmnet, mw1385.eqiad.wmnet, mw1369.eqiad.wmnet, mw1413.eqiad.wmnet, mw1353.eqiad.wmnet, mw1273.eqiad.wmnet, mw1262.eqiad.wmnet, mw1411.eqiad.wmnet, mw1330.eqiad.wm https://wikitech.wikimedia.org/wiki/PyBal [05:08:05] PROBLEM - Apache HTTP on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:05] PROBLEM - Apache HTTP on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:06] PROBLEM - Apache HTTP on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:06] PROBLEM - Nginx local proxy to apache on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:07] PROBLEM - PHP7 rendering on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:10] PROBLEM - Apache HTTP on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:10] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:10] PROBLEM - Apache HTTP on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:10] PROBLEM - Nginx local proxy to apache on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:10] PROBLEM - Nginx local proxy to apache on mw1411 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:11] PROBLEM - Nginx local proxy to apache on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:11] PROBLEM - Nginx local proxy to apache on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:12] PROBLEM - Nginx local proxy to apache on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={jmx_wdqs_updater,swagger_check_citoid_cluster_eqiad,swagger_check_mobileapps_cluster_eqiad,swagger_check_restbase_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:08:13] PROBLEM - Nginx local proxy to apache on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:14] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [05:08:14] PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:08:15] PROBLEM - Apache HTTP on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:15] PROBLEM - Apache HTTP on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:15] PROBLEM - Apache HTTP on mw1385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:16] PROBLEM - Apache HTTP on mw1265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:16] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:17] PROBLEM - Nginx local proxy to apache on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:17] PROBLEM - Nginx local proxy to apache on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:18] PROBLEM - Nginx local proxy to apache on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:18] PROBLEM - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:08:19] PROBLEM - Nginx local proxy to apache on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:19] PROBLEM - Nginx local proxy to apache on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:20] PROBLEM - Nginx local proxy to apache on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:22] oww... [05:08:22] PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:22] PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:22] PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:22] PROBLEM - Apache HTTP on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:22] PROBLEM - Nginx local proxy to apache on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:23] PROBLEM - Nginx local proxy to apache on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:23] PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:24] PROBLEM - Nginx local proxy to apache on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:24] PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:25] PROBLEM - Apache HTTP on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:25] PROBLEM - Nginx local proxy to apache on mw1389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:26] PROBLEM - Apache HTTP on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:26] PROBLEM - Nginx local proxy to apache on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:27] PROBLEM - Nginx local proxy to apache on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:27] PROBLEM - PHP7 rendering on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:08:28] PROBLEM - PHP7 rendering on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:29] PROBLEM - Nginx local proxy to apache on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:29] PROBLEM - Nginx local proxy to apache on mw1385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:30] PROBLEM - Nginx local proxy to apache on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:30] PROBLEM - Nginx local proxy to apache on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:31] PROBLEM - Nginx local proxy to apache on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:31] PROBLEM - Nginx local proxy to apache on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:32] PROBLEM - PHP7 rendering on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:32] PROBLEM - PHP7 rendering on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:33] PROBLEM - PHP7 rendering on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:33] PROBLEM - Nginx local proxy to apache on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:34] PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:34] PROBLEM - PHP7 rendering on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:35] PROBLEM - PHP7 rendering on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:35] PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:36] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:08:38] PROBLEM - Apache HTTP on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:40] PROBLEM - PHP7 rendering on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:40] PROBLEM - PHP7 rendering on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:40] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:08:40] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:08:42] PROBLEM - LVS HTTPS IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:08:42] PROBLEM - PHP7 rendering on mw1263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:42] PROBLEM - PHP7 rendering on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:42] PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:43] PROBLEM - Apache HTTP on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:43] PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:43] PROBLEM - Nginx local proxy to apache on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:44] PROBLEM - PHP7 rendering on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:08:44] PROBLEM - Nginx local proxy to apache on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:45] PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:45] PROBLEM - Apache HTTP on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:46] PROBLEM - ATS TLS has reduced HTTP availability #page on icinga1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [05:08:46] PROBLEM - Nginx local proxy to apache on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:08:48] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:08:48] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:08:48] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:08:48] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:08:49] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:08:49] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:08:50] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:08:50] PROBLEM - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1085.eqiad.wmnet, cp1089.eqiad.wmnet, cp1075.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1079.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1089.eqiad.wmnet, cp1075.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: testlb6_4 [05:08:51] 1.eqiad.wmnet, cp1079.eqiad.wmnet, cp1087.eqiad.wmnet, cp1089.eqiad.wmnet, cp1075.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1085.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:08:51] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:08:52] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:08:52] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:08:53] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:08:54] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [05:09:00] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:09:00] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:09:00] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:09:02] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [05:09:06] PROBLEM - LVS HTTPS IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:09:07] PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/ [05:09:08] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [05:09:08] PROBLEM - Apache HTTP on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:09:14] PROBLEM - PHP7 rendering on mw1385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:09:14] moritzm: Ops Clinic Duty [05:09:14] PROBLEM - LVS HTTP IPv4 #page on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:09:15] PROBLEM - LVS HTTPS IPv4 #page on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:09:15] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [05:09:16] PROBLEM - Apache HTTP on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:09:16] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Pr [05:09:16] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [05:09:20] PROBLEM - Nginx local proxy to apache on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:09:20] PROBLEM - Nginx local proxy to apache on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:09:22] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:22] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:22] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:22] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:26] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1265.eqiad.wmnet, mw1371.eqiad.wmnet, mw1365.eqiad.wmnet, mw1367.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1355.eqiad.wmnet, mw1323.eqiad.wmnet, mw1384.eqiad.wmnet, mw1327.eqiad.wmnet, mw1351.eqiad.wmnet, mw1413.eqiad.wmnet, mw1364.eqiad.wmnet, mw1354.eqiad.wmnet, mw1272.eqiad.wmnet, mw1263.eqiad.wmnet, mw1274 [05:09:26] 405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1352.eqiad.wmnet, mw1264.eqiad.wmnet, mw1399.eqiad.wmnet, mw1266.eqiad.wmnet, mw1391.eqiad.wmnet, mw1321.eqiad.wmnet, mw1333.eqiad.wmnet, mw1393.eqiad.wmnet, mw1366.eqiad.wmnet, mw1349.eqiad.wmnet, mw1269.eqiad.wmnet, mw1350.eqiad.wmnet, mw1389.eqiad.wmnet, mw1331.eqiad.wmnet, mw1319.eqiad.wmnet, mw1271.eqiad.wmnet, mw1387.eqiad.wmnet, mw1268.eqiad.wmnet, mw1395.eqiad.wmn [05:09:26] wmnet, mw1403.eqiad.wmnet, mw1325.eqiad.wmnet, mw1407.eqiad.wmnet, mw1409.eqiad.wmnet, mw1 https://wikitech.wikimedia.org/wiki/PyBal [05:09:30] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [05:09:31] PROBLEM - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:09:31] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:31] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:32] PROBLEM - LVS HTTPS IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:09:40] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:40] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:46] PROBLEM - LVS HTTPS IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:09:46] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:50] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:09:50] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:50] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:50] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:09:56] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:09:58] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:10:00] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:10:00] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:10:00] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:10:06] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:10:08] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:08] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:08] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:15] PROBLEM - LVS HTTP IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:10:24] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:10:26] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 10.16 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:10:34] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:10:40] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [05:10:50] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:52] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:52] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:54] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:54] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:54] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:56] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:56] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:56] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:56] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:56] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:58] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:58] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:10:58] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:00] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:00] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:12] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:16] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:16] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:16] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:20] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:11:26] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:26] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:26] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:26] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:26] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:27] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:27] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:28] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:28] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:29] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:29] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:32] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:40] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:42] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:42] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:42] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:52] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:11:54] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:11:58] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:12:01] <_joe_> wat [05:12:02] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:12:02] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:12:04] * volans|off here [05:12:19] was just about to ask if anybody's around [05:12:36] same [05:12:37] seeing reports of outage, enwiki not loading here. [05:12:38] * shdubsh waves [05:12:40] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:12:43] 504 gateway time-out [05:12:48] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [05:12:57] Yeah, I'm having issues, too.. West Coast US [05:13:08] "upstream connect error or disconnect/reset before headers. reset reason: overflow" [05:13:10] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:13:11] Ptwiki is off form Brazil [05:13:14] PROBLEM - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is CRITICAL: CRITICAL - exception while fetching the URL. 502 Server Error: Next Hop Connection Failed for url: https://en.planet.wikimedia.org/ https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [05:13:24] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 15 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:13:45] Commons too, 502'ing [05:13:46] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:13:52] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:13:59] It was lagging earlier. [05:14:06] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:14:11] eqsin is up, but laggy [05:14:18] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:52] https://873gear.com/irc/uploads/7b27e682b7bd2cab/IMG_20200412_021416_158.jpg here's the returning error [05:15:08] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:15:26] I'm mostly getting the WMF ATS 502s, but also the plain-text upstream connect error or disconnect/reset before headers. reset reason: overflow [05:15:54] PROBLEM - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/71/ [05:16:36] Yeah, just got a 502 [05:16:42] RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 172039 bytes in 7.800 second response time https://phabricator.wikimedia.org/project/view/1118/ [05:16:44] Grafana is also 502 for me, from boston area [05:17:06] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 9880 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:17:08] can confirm [05:17:28] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:18:01] Phab and wikitech working for me in EU, main wikis all down for me [05:18:10] If grafana should work in mobile, it's completely blank to me in Brazil [05:18:14] logstash and ichinga is up for me [05:18:19] JP , eqsin [05:18:24] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [05:18:28] No error, just emptiness [05:18:40] Following https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue, CURLs to everything but eqiad work as expected, eqiad 502s [05:18:52] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:18:56] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 6 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:19:22] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:21:16] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:22:00] Hello. Do anyone knows why ru.wikipedia.org returns 504 Gateway Time-out ? [05:22:20] PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/ [05:22:22] Vort: known issue, ops are working on it [05:22:26] Thanks [05:22:30] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={2,3} site=eqiad topic=udp_localhost-err https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=loggi [05:22:30] c=All&var-consumer_group=All [05:23:24] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:28] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:23:54] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:30] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 3e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:25:02] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:25:30] PROBLEM - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is CRITICAL: CRITICAL - exception while fetching the URL. 502 Server Error: Next Hop Connection Failed for url: https://en.planet.wikimedia.org/ https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [05:26:54] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:27:16] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:12] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 4 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:28:34] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:28:36] (03PS1) 10BBlack: Strip certain parameters [puppet] - 10https://gerrit.wikimedia.org/r/588133 [05:29:04] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:29:13] (03CR) 10BBlack: [V: 03+2 C: 03+2] Strip certain parameters [puppet] - 10https://gerrit.wikimedia.org/r/588133 (owner: 10BBlack) [05:30:30] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:02] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:02] PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [05:31:16] !log pushing https://gerrit.wikimedia.org/r/588133 to cache_text [05:31:54] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 7146 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:32:26] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:32:26] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:32:56] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:33:04] bblack: Failed to log message to wiki. Somebody should check the error logs. [05:33:16] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:33:18] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:33:20] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:33:20] RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 172039 bytes in 0.014 second response time https://phabricator.wikimedia.org/project/view/1118/ [05:33:22] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:33:22] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:33:40] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:33:48] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 5 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:33:50] RECOVERY - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is OK: OK - Website content is current (602 = 86400) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [05:33:50] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:50] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.519 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:52] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 3.532 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:52] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 3.659 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:52] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:52] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:52] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:53] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:53] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:54] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 7.203 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:56] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.323 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:56] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.461 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:58] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:33:58] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:00] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 3.667 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:08] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:34:10] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:12] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:12] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:12] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:12] \o/ [05:34:16] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [05:34:18] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:34:18] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:34:18] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:34:20] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:34:22] Hooray [05:34:22] RECOVERY - PHP7 rendering on mw1395 is OK: HTTP OK: HTTP/1.1 200 OK - 75850 bytes in 8.620 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:34:22] RECOVERY - Nginx local proxy to apache on mw1395 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.528 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:24] RECOVERY - Apache HTTP on mw1391 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.873 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:24] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:24] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:24] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:24] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:25] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:25] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:26] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:26] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:27] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:27] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:28] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:28] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [05:34:29] RECOVERY - Apache HTTP on mw1368 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.401 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:29] RECOVERY - Nginx local proxy to apache on mw1405 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.796 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:30] RECOVERY - LVS HTTPS IPv4 #page on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15237 bytes in 6.545 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:34:30] RECOVERY - Nginx local proxy to apache on mw1411 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 2.294 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:31] RECOVERY - Nginx local proxy to apache on mw1354 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:31] RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.368 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:32] RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.529 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:32] RECOVERY - Nginx local proxy to apache on mw1399 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:33] RECOVERY - Nginx local proxy to apache on mw1366 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.571 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:33] RECOVERY - Nginx local proxy to apache on mw1272 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:34] RECOVERY - LVS HTTP IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:34:34] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.757 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:35] RECOVERY - Apache HTTP on mw1349 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.846 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:35] RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.612 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:36] RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15236 bytes in 1.506 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:34:36] RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.636 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:37] RECOVERY - Apache HTTP on mw1395 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:37] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:38] RECOVERY - Nginx local proxy to apache on mw1364 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 4.211 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:38] RECOVERY - Apache HTTP on mw1385 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:39] RECOVERY - Apache HTTP on mw1397 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:39] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:34:40] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:34:40] RECOVERY - Nginx local proxy to apache on mw1333 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.646 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:41] RECOVERY - Apache HTTP on mw1331 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.955 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:41] RECOVERY - Nginx local proxy to apache on mw1268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:42] RECOVERY - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15249 bytes in 0.309 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:34:42] RECOVERY - Nginx local proxy to apache on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 4.743 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:43] RECOVERY - Nginx local proxy to apache on mw1332 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.087 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:43] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.156 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:44] RECOVERY - Nginx local proxy to apache on mw1324 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.191 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:44] RECOVERY - Nginx local proxy to apache on mw1319 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.832 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:45] RECOVERY - Nginx local proxy to apache on mw1327 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.908 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:45] RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:46] RECOVERY - Apache HTTP on mw1371 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:46] RECOVERY - Apache HTTP on mw1350 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:34:58] R.I.P icinga [05:35:11] welcome back [05:35:25] Thank you ops staff. [05:35:36] big round of applause [05:35:38] Yes, thank you ops staff! :) [05:35:39] Hi everybody, the SRE team is working on the issue, thanks for reporting [05:35:43] Thanks folks :D [05:35:46] * TheSandDoctor applauds [05:35:48] thanks folks [05:35:50] ttyl [05:35:59] what is problem? disk full? [05:36:03] <_joe_> can you confirm everything is working for you now? [05:36:06] up here [05:36:16] It's working now from russia [05:36:28] Working [05:36:29] arwiki working (Y) [05:36:35] I doubt it's disk space [05:36:37] Up here [05:36:39] dewiki working [05:36:45] can confirm enwiki and commons in BC @_joe_ [05:37:02] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:37:02] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:37:02] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:37:03] server: mw1263.eqiad.wmnet x-cache: cp5010 miss, cp5008 pass x-cache-status: pass [05:37:04] RECOVERY - ATS TLS has reduced HTTP availability #page on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [05:37:09] de and meta too. thanks again folks [05:37:14] ok at JP eqsin [05:37:24] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [05:37:38] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [05:37:44] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:38:00] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [05:38:25] enwiki up for me [05:38:38] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:38:40] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [05:38:44] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:38:55] 10Operations: Slow response times and 504 Gateway tomeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10Pruem) [05:39:02] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:06] RECOVERY - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1737 bytes in 0.098 second response time https://phabricator.wikimedia.org/project/view/71/ [05:40:22] 10Operations: Slow response times and 504 Gateway tomeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10RhinosF1) This was resolved a few moments ago, are you still getting the issues? [05:41:12] 10Operations: Slow response times and 504 Gateway tomeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10colewhite) p:05Triage→03Unbreak! a:03colewhite [05:42:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:42:22] _joe_: ^ see that task just opened [05:42:24] 10Operations: Slow response times and 504 Gateway tomeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10Pruem) As of now, it seems to have ceased. I'll keep checking. [05:42:59] 10Operations: Slow response times and 504 Gateway tomeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10RhinosF1) 05Open→03Resolved p:05Unbreak!→03Medium a:05colewhite→03None This was fixed. [05:43:07] 10Operations: Slow response times and 504 Gateway timeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10Pruem) 05Resolved→03Open p:05Medium→03Unbreak! a:03colewhite [05:43:19] 10Operations: Slow response times and 504 Gateway timeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10Koavf) Appears resolved. Can others help document at https://wikitech.wikimedia.org/wiki/Incident_documentation/20200412-eqiad_down [05:43:29] RhinosF1: It was claimed by a WMF staff member [05:43:42] RhinosF1: Cole is a WMF staff member [05:43:42] 10Operations: Slow response times and 504 Gateway timeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10Koavf) p:05Unbreak!→03Low [05:43:45] Don't close a task claimed by someone else [05:43:56] 10Operations: Slow response times and 504 Gateway timeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10Pruem) 05Open→03Resolved [05:44:12] I didn’t realise, there’s no on wiki account connected to see or information in the phab profile [05:44:50] There is the LDAP account, which is a redirect to his WMF account on SUL wikis [05:45:12] <_joe_> hey everyone, don't edit-war on phabricator :D [05:45:38] 10Operations: Slow response times and 504 Gateway timeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10Joe) >>! In T250025#6049704, @Koavf wrote: > Appears resolved. Can others help document at https://wikitech.wikimedia.org/wiki/Incident_documentation/20200412-eqiad_down Please don... [05:45:55] but phab haven't good mechanism for avoid edit conflicts.. [05:45:57] Too many hands in the pots :) [05:46:24] :P [05:46:27] indeed' [05:48:09] (03PS1) 10BBlack: Fix regex syntax in prev commit [puppet] - 10https://gerrit.wikimedia.org/r/588134 [05:48:24] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [05:49:30] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:15] !log restart ats-tls on cp[1077,1081,1083,1085].eqiad.wmnet- T249335 [05:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:22] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [05:50:26] (03PS2) 10BBlack: Fix regex syntax in prev commit [puppet] - 10https://gerrit.wikimedia.org/r/588134 [05:52:12] (03CR) 10BBlack: [V: 03+2 C: 03+2] Fix regex syntax in prev commit [puppet] - 10https://gerrit.wikimedia.org/r/588134 (owner: 10BBlack) [05:53:09] !log pushing https://gerrit.wikimedia.org/r/588134 to cache_text [05:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:08] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:56:26] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:52] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.001274 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [06:01:44] RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [06:17:36] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [06:20:59] !log powercycle restbase1025 (not reachable, serial console shows blank, racadm getsel reports errors with DIMM_B2) [06:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:42] RECOVERY - Host restbase1025 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [06:26:14] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10elukey) [06:27:03] (03PS1) 10Ema: Rate limit non-API traffic from public clouds [puppet] - 10https://gerrit.wikimedia.org/r/588135 [06:27:34] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10elukey) Caught during boot: ` UEFI0106: One or more memory correctable training errors have occurred on memory slot: B2. Remove input power to the system, reseat the DIMM module and restart th... [06:27:40] PROBLEM - Host restbase1025 is DOWN: PING CRITICAL - Packet loss = 100% [06:27:50] this is me sorry --^ [06:27:52] RECOVERY - Host restbase1025 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [06:27:55] host up now [06:28:55] thanks elukey [06:29:21] volans: I can't ssh to it, I think we should depool it :( [06:30:05] ah snap frozen again, yes it is not usable [06:30:12] PROBLEM - Host restbase1025 is DOWN: PING CRITICAL - Packet loss = 100% [06:30:17] it just rebooted by itself [06:30:18] elukey: ack, kill it [06:30:30] if loops the reboot power it dow [06:30:31] *down [06:32:11] !log powerdown restbase1025 - T250027 [06:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:17] T250027: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 [06:32:27] elukey: I'd say let's also depool it officially [06:34:01] yep doing it [06:35:34] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1025.eqiad.wmnet [06:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:06] <3 elukey! [06:36:49] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10elukey) ` elukey@puppetmaster1001:~$ sudo confctl depool --hostname restbase1025.eqiad.wmnet eqiad/restbase/restbase/restbase1025.eqiad.wmnet: pooled changed yes => no eqiad/restbase/restbase-b... [06:37:05] ok all done [06:59:24] !log restarting blazegraph on wdqs1004 (T242453) [06:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:30] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [07:21:51] <_joe_> elukey: i think something will be needed for cassandra too, but i'm inclined to leave it for later [07:58:20] 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10Adithyak1997) I don't know whether its related. Yesterday, some of the users including me have faced problems logging into... [08:52:25] 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10Lirazelf) Hi there, was pointed in this direction by the folks at wikidata:project chat - I'm also experiencing issues wit... [09:28:42] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Duplicate "moderator request(s) waiting" emails sent to list admins - https://phabricator.wikimedia.org/T250032 (10Aklapper) [09:57:28] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10elukey) @Eevans adding yourself to this task as FYI :) [10:18:11] !log restart wdqs-updater on wdqs1004 (logs show no reports from the past hours, last one were stack traces related to a json decode failure) [10:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:00] (03CR) 10Nikerabbit: "I did the former: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/587251 (which also updated the misleading comment)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586353 (https://phabricator.wikimedia.org/T165128) (owner: 10Nikerabbit) [10:41:40] (03Abandoned) 10Nikerabbit: Restore Beta Cluster logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586353 (https://phabricator.wikimedia.org/T165128) (owner: 10Nikerabbit) [11:01:28] 10Operations, 10Wikimedia-Incident: Slow response times and 504 Gateway timeouts accross all wiki projects - https://phabricator.wikimedia.org/T250025 (10Peachey88) [11:11:34] !log restart ats-tls on cp5008.eqsin.wmnet - T249335 [11:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:41] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [11:14:42] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:44] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:38] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:54:46] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5012 is OK: HTTP OK: HTTP/1.0 200 OK - 22380 bytes in 2.933 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:32:15] (03PS6) 10Zoranzoki21: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) [14:36:06] 10Operations, 10Wikimedia-Mailing-lists: Password reset for wikiwomencamp-bounces mainling list - https://phabricator.wikimedia.org/T250035 (10AnnaTorres) [14:40:40] 10Operations, 10Wikimedia-Mailing-lists: Password reset for wikiwomencamp-bounces mainling list - https://phabricator.wikimedia.org/T250035 (10AnnaTorres) [14:46:02] 10Operations, 10Wikimedia-Mailing-lists: Password reset for wikiwomencamp-bounces mainling list - https://phabricator.wikimedia.org/T250035 (10Reedy) a:05AnnaTorres→03None [15:03:01] 10Operations, 10Wikimedia-Mailing-lists: Password reset for admin of wikiwomencamp mailing list - https://phabricator.wikimedia.org/T250035 (10Aklapper) [15:04:03] 10Operations, 10Wikimedia-Mailing-lists: Password reset for admin of wikiwomencamp mailing list - https://phabricator.wikimedia.org/T250035 (10Aklapper) @AnnaTorres: Once you have access again, you probably want to remove kherold from the second field "The list administrator email addresses" on https://lists.w... [15:57:37] 10Operations, 10Keyholder, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10hashar) I forgot to check for open changes, thank you for the notification. I guess it is tim... [16:10:06] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Rebuild helm/helm-diff for buster-wikimedia - https://phabricator.wikimedia.org/T249812 (10hashar) [18:57:48] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:57:48] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:57:48] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:58:10] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:58:20] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:58:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [18:58:24] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:58:46] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:59:34] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:59:54] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:00:34] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:01:56] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:03:08] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:03:38] (03PS1) 10Andrew Bogott: designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) [19:03:52] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:05:02] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:05:26] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:07:14] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:07:42] (03CR) 10jerkins-bot: [V: 04-1] designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:09:16] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:13:40] (03PS2) 10Andrew Bogott: designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) [19:17:36] (03CR) 10jerkins-bot: [V: 04-1] designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:17:58] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:18:02] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:18:24] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:21:04] (03PS3) 10Andrew Bogott: designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) [19:21:32] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:23:18] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:23:50] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:25:07] (03CR) 10jerkins-bot: [V: 04-1] designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [19:28:12] (03PS4) 10Andrew Bogott: designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) [19:29:16] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:10] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:32:52] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:34:50] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:38:28] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:41:48] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:41:50] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:41:50] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:42:06] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:42:24] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:43:32] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:43:54] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:44:10] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:45:28] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:45:30] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:45:42] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:45:44] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:50:58] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:51:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:52:46] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:52:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:58:34] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:02:02] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:03:12] (03PS1) 10Andrew Bogott: (WIP) Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [20:03:48] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:04:04] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:05:54] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:07:06] (03CR) 10jerkins-bot: [V: 04-1] (WIP) Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [20:09:38] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [20:09:42] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:10:08] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:10:22] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:11:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:11:18] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:11:20] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:11:28] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:11:52] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:12:08] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:13:18] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:13:26] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:46] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:14:54] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:14:56] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:15:30] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:18:46] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:18:54] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:22:30] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:24:18] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:27:10] 10Operations, 10serviceops, 10Patch-For-Review: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 (10Framawiki) >>! In T249535#6048762, @Nattes wrote: > Hi I dont know if this is the riight pace to ask. So... [20:30:28] 10Operations, 10Growth-Team, 10StructuredDiscussions: Flow failing with Error contacting the Parsoid/RESTBase server (HTTP 400) - https://phabricator.wikimedia.org/T249997 (10Framawiki) [20:32:00] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:33:24] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [20:33:58] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:34:52] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10Framawiki) The task title is pretty clear now. It's about redirecting everything non-names... [20:35:04] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:35:10] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:35:24] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:35:40] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:35:46] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:36:19] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10RhinosF1) >>! In T249648#6050573, @Framawiki wrote: > The task title is pretty clear now.... [20:36:52] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:36:58] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:37:14] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:38:40] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:38:48] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:42:32] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:46:10] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [20:47:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:47:56] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:47:56] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:48:28] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:48:40] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:49:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:50:16] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:51:30] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:51:30] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:52:12] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:53:28] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:55:33] 10Operations, 10MediaWiki-Cache, 10Traffic, 10Performance-Team (Radar): Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835 (10Krinkle) [20:57:06] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:58:02] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:58:56] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:59:46] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:59:48] 10Operations, 10MediaWiki-Cache, 10Traffic, 10Performance-Team (Radar): Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835 (10Krinkle) [21:00:51] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Traffic, 10Performance-Team (Radar): Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835 (10Krinkle) As part of my focus on stability/sustainability, I'd like to try taking this on as part of the Perf Team.... [21:01:02] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Traffic, 10Performance-Team (Radar): Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835 (10Krinkle) a:03Krinkle [21:01:57] (03PS2) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [21:02:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:05:04] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:06:16] (03CR) 10jerkins-bot: [V: 04-1] Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [21:06:48] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:08:00] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:09:41] (03PS3) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [21:11:40] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:17:10] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:26:05] (03PS5) 10Andrew Bogott: designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) [21:26:07] (03PS4) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [21:26:09] (03PS1) 10Andrew Bogott: designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588175 [21:28:06] (03CR) 10jerkins-bot: [V: 04-1] designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588175 (owner: 10Andrew Bogott) [21:30:05] (03PS2) 10Andrew Bogott: designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588175 [21:30:07] (03PS6) 10Andrew Bogott: designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) [21:30:09] (03PS5) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [21:30:55] (03CR) 10jerkins-bot: [V: 04-1] designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588175 (owner: 10Andrew Bogott) [21:31:53] 10Operations, 10Growth-Team, 10StructuredDiscussions, 10VisualEditor: Flow failing with Error contacting the Parsoid/RESTBase server (HTTP 400) - https://phabricator.wikimedia.org/T249997 (10Framawiki) I've received an email via OTRS regarding this error on frwiki from an editor that was using #visualedito... [21:33:06] (03PS7) 10Andrew Bogott: designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) [21:33:08] (03PS6) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [21:33:10] (03PS1) 10Andrew Bogott: designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588176 [21:35:58] (03Abandoned) 10Andrew Bogott: designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588175 (owner: 10Andrew Bogott) [21:44:53] (03PS2) 10Andrew Bogott: designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588176 [21:44:55] (03PS8) 10Andrew Bogott: designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) [21:44:57] (03PS7) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [21:45:28] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10Bugreporter) Note sco.wiktionary.org/wiki/ and sco.wiktionary.org should be redirected a v... [21:49:29] (03CR) 10jerkins-bot: [V: 04-1] designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588176 (owner: 10Andrew Bogott) [21:51:04] (03PS3) 10Andrew Bogott: designate: change api_base_uri to proper HA endpoint [puppet] - 10https://gerrit.wikimedia.org/r/588176 [21:51:06] (03PS9) 10Andrew Bogott: designate: remove second_region_* hiera values [puppet] - 10https://gerrit.wikimedia.org/r/588163 (https://phabricator.wikimedia.org/T249941) [21:51:08] (03PS8) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [21:55:11] (03CR) 10Andrew Bogott: "pcc run: https://puppet-compiler.wmflabs.org/compiler1002/21856/cloudservices1003.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/588176 (owner: 10Andrew Bogott) [22:03:52] (03PS9) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [22:37:10] (03PS10) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [22:47:50] (03PS11) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [22:50:27] 10Operations, 10MediaWiki-General, 10Traffic: Requests with utf-8 in the URL return a outdated page revision - https://phabricator.wikimedia.org/T23027 (10Krinkle) [22:51:42] (03PS12) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [22:53:37] (03PS13) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [22:56:48] (03PS14) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [23:01:24] (03CR) 10jerkins-bot: [V: 04-1] Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [23:06:33] (03PS15) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [23:11:48] (03PS16) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [23:14:23] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: General GET/POST limiting in MediaWiki - https://phabricator.wikimedia.org/T115088 (10Krinkle) 05Open→03Resolved a:03Krinkle >>! In T20489#6050810, @Krinkle wrote: > […] For the concern of general load and concurrency (not individual... [23:16:52] (03PS17) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [23:22:28] (03PS18) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [23:26:04] (03PS19) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [23:32:29] (03PS20) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [23:37:11] (03PS21) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [23:47:09] (03PS22) 10Andrew Bogott: Designate: use a list of designate hosts in hiera [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) [23:53:00] (03CR) 10Andrew Bogott: "sample pcc run: https://puppet-compiler.wmflabs.org/compiler1003/21872/" [puppet] - 10https://gerrit.wikimedia.org/r/588169 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott)