[01:17:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:19:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:45:17] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:14:49] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:30:45] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:42:09] 10SRE, 10Wikimedia-General-or-Unknown: Periodically run purgeExpiredBlocks.php maintenance script - https://phabricator.wikimedia.org/T257473 (10DannyS712) 05Stalled→03Open Should have been unstalled a while ago [03:10:57] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [03:22:15] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:29:27] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:01:37] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:03:07] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 93490008 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:05:33] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 83784 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:16:12] 10SRE, 10Wikimedia-General-or-Unknown, 10Wikimedia-maintenance-script-run: Periodically run purgeExpiredBlocks.php maintenance script - https://phabricator.wikimedia.org/T257473 (10Peachey88) [04:26:23] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:39:03] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:59:25] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:41:45] RECOVERY - SSH on mw2227.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:58:51] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:39] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210314T0800) [09:20:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:22:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:52:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:49:03] (03PS1) 10Rohitesh20: Add logger to recover-dump::to indicate actions taken [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/671942 (https://phabricator.wikimedia.org/T277162) [10:49:33] PROBLEM - SSH on mw2227.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:31:41] (03CR) 10Zabe: [C: 03+1] Add deleterevision right to botadmin group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671402 (https://phabricator.wikimedia.org/T277358) (owner: 10Huji) [11:40:05] (03PS3) 10Zabe: Enable DiscussionsTools for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669960 (https://phabricator.wikimedia.org/T276851) [11:50:55] RECOVERY - SSH on mw2227.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:59:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [12:01:47] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [12:04:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [12:06:47] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [12:56:53] PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:31:21] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:32:59] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:24:12] I am a little confused, I didn't receive a page [14:43:10] !log depool wdqs1005 and restart blazegraph - will keep depooled until this server has catched up on lag [14:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:17] RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:45:55] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.080 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:46:41] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 4.432e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:52:37] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [16:19:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:22:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:45:13] PROBLEM - Varnish frontend child restarted on cp5011 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp5011&var-datasource=eqsin+prometheus/ops [16:56:32] (03PS1) 10Zabe: Configure default search namespaces for thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672041 (https://phabricator.wikimedia.org/T275280) [17:00:41] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.9841 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:00:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:01:55] PROBLEM - Apache HTTP on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:05] PROBLEM - Apache HTTP on mw1410 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:05] PROBLEM - PHP7 rendering on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:09] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:09] PROBLEM - PHP7 rendering on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:13] PROBLEM - Apache HTTP on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:13] PROBLEM - PHP7 rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:13] PROBLEM - Apache HTTP on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:13] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:13] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:19] PROBLEM - Apache HTTP on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:21] PROBLEM - PHP7 rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:25] PROBLEM - PHP7 rendering on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:25] PROBLEM - PHP7 rendering on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:25] PROBLEM - PHP7 rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:25] PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:27] PROBLEM - PHP7 rendering on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:27] PROBLEM - PHP7 rendering on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:29] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:29] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:33] PROBLEM - PHP7 rendering on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:35] PROBLEM - Apache HTTP on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:35] PROBLEM - PHP7 rendering on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:37] PROBLEM - Apache HTTP on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:37] PROBLEM - Apache HTTP on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:41] PROBLEM - PHP7 rendering on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:41] PROBLEM - PHP7 rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:41] PROBLEM - Apache HTTP on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:45] uhhhhhhhhhh [17:02:47] PROBLEM - Apache HTTP on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:55] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:02:55] PROBLEM - PHP7 rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:55] PROBLEM - Apache HTTP on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:57] PROBLEM - PHP7 rendering on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:57] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:58] wat is happening [17:02:59] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:02:59] PROBLEM - PHP7 rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:02:59] PROBLEM - PHP7 rendering on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:01] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:01] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:04] wikis still load for me [17:03:05] PROBLEM - Apache HTTP on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:10] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.01299 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [17:03:15] PROBLEM - PHP7 rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:15] PROBLEM - PHP7 rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:17] PROBLEM - PHP7 rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:17] PROBLEM - Apache HTTP on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:17] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:17] PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:19] PROBLEM - PHP7 rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:20] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.2382 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [17:03:27] PROBLEM - PHP7 rendering on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:27] PROBLEM - PHP7 rendering on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:29] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:29] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [17:03:29] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:33] PROBLEM - PHP7 rendering on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:33] PROBLEM - PHP7 rendering on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:34] Majavah: those are all API servers [17:03:35] PROBLEM - Apache HTTP on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:35] PROBLEM - Apache HTTP on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:38] ah [17:03:41] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most [17:03:41] January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wi [17:03:45] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [17:03:47] PROBLEM - Apache HTTP on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:47] PROBLEM - PHP7 rendering on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:47] PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:47] PROBLEM - PHP7 rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:47] PROBLEM - PHP7 rendering on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:47] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:49] PROBLEM - PHP7 rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:49] PROBLEM - Apache HTTP on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:49] PROBLEM - PHP7 rendering on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:49] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:50] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:50] upstream connect error or disconnect/reset before headers. reset reason: overflow [17:03:50] PROBLEM - Apache HTTP on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:51] PROBLEM - Apache HTTP on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:51] api down? [17:03:54] <_joe_> yes [17:03:57] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond fil [17:03:57] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [17:03:57] PROBLEM - Apache HTTP on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:03:58] <_joe_> it seems [17:03:59] PROBLEM - PHP7 rendering on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:59] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:03:59] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:03:59] PROBLEM - PHP7 rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:03:59] PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:01] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:03] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:03] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expectin [17:04:03] est/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:04:05] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:07] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:04:07] <_joe_> give me 5 mins and I'll be at a computer [17:04:07] PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:09] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond fil [17:04:09] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [17:04:11] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:11] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:11] PROBLEM - PHP7 rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:04:13] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:15] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:17] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:17] PROBLEM - Apache HTTP on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:17] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:18] i don [17:04:19] PROBLEM - PHP7 rendering on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:04:19] PROBLEM - PHP7 rendering on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:04:19] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:21] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:21] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:21] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:27] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [17:04:29] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:30] i don't think I can, but if I can help in any way let me know [17:04:31] PROBLEM - PHP7 rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:04:31] PROBLEM - Apache HTTP on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:31] PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:31] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [17:04:33] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 682 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:04:35] PROBLEM - PHP7 rendering on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:04:35] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:35] PROBLEM - PHP7 rendering on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:04:37] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:37] PROBLEM - Apache HTTP on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:39] PROBLEM - Apache HTTP on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:39] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:41] PROBLEM - Apache HTTP on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:41] PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:43] PROBLEM - Apache HTTP on mw1408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:45] RECOVERY - Apache HTTP on mw1410 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.410 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:04:47] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:04:49] PROBLEM - PHP7 rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:04:51] PROBLEM - PHP7 rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:04:51] PROBLEM - PHP7 rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:04:55] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:04:55] PROBLEM - Apache HTTP on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:04:59] PROBLEM - PHP7 rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:05:01] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.8788 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:05:01] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:05:01] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most [17:05:01] January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [17:05:02] PROBLEM - Apache HTTP on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:05:03] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:03] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1284.eqiad.wmnet, mw1362.eqiad.wmnet, mw1380.eqiad.wmnet, mw1390.eqiad.wmnet, mw1406.eqiad.wmnet, mw1297.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1287.eqiad.wmnet, mw1348.eqiad.wmnet, mw1288.eqiad.wmnet, mw1346.eqiad.wmnet, mw1314.eqiad.wmnet, mw1386.eqiad.wmnet, mw1410.eqiad.wmnet, mw1342.eqiad.wmnet, mw1396. [17:05:03] 04.eqiad.wmnet, mw1283.eqiad.wmnet, mw1381.eqiad.wmnet, mw1388.eqiad.wmnet, mw1340.eqiad.wmnet, mw1343.eqiad.wmnet, mw1281.eqiad.wmnet, mw1347.eqiad.wmnet, mw1345.eqiad.wmnet, mw1339.eqiad.wmnet, mw1286.eqiad.wmnet, mw1282.eqiad.wmnet, mw1276.eqiad.wmnet, mw1412.eqiad.wmnet, mw1279.eqiad.wmnet, mw1408.eqiad.wmnet, mw1376.eqiad.wmnet, mw1317.eqiad.wmnet, mw1278.eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1312.eqiad.wmne [17:05:03] mnet, mw1402.eqiad.wmnet, mw1383.eqiad.wmnet, mw1392.eqiad.wmnet, mw1400.eqiad.wmnet, mw1382.eqiad.wmnet, mw1289.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1285.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [17:05:03] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} ( [17:05:04] metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was rece [17:05:04] 1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:05:04] PROBLEM - PHP7 rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:05:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:05:06] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:06] RECOVERY - PHP7 rendering on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.792 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:05:06] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:07] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:07] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:08] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:05:09] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:05:10] PROBLEM - LVS api eqiad port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:05:10] PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:05:13] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:13] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:15] RECOVERY - PHP7 rendering on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.367 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:05:15] RECOVERY - PHP7 rendering on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 7.610 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:05:15] RECOVERY - Apache HTTP on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 7.413 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:05:23] RECOVERY - PHP7 rendering on mw1340 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 8.147 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:05:25] RECOVERY - PHP7 rendering on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.940 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:05:27] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:29] RECOVERY - Apache HTTP on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.157 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:05:35] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:41] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:43] PROBLEM - PHP7 rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:05:47] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:47] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:47] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:05:51] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:05:53] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:05:55] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:05:57] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:06:01] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:06:01] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1284.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1390.eqiad.wmnet, mw1361.eqiad.wmnet, mw1297.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1287.eqiad.wmnet, mw1362.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1386.eqiad.wmnet, mw1348.eqiad.wmnet, mw1402.eqiad.wmnet, mw1404. [17:06:01] 83.eqiad.wmnet, mw1381.eqiad.wmnet, mw1388.eqiad.wmnet, mw1317.eqiad.wmnet, mw1340.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1286.eqiad.wmnet, mw1282.eqiad.wmnet, mw1276.eqiad.wmnet, mw1412.eqiad.wmnet, mw1279.eqiad.wmnet, mw1408.eqiad.wmnet, mw1410.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1278.eqiad.wmnet, mw1290.eqiad.wmnet, mw1316.eqiad.wmnet, mw1400.eqiad.wmnet, mw1383.eqiad.wmne [17:06:01] mnet, mw1375.eqiad.wmnet, mw1342.eqiad.wmnet, mw1360.eqiad.wmnet, mw1382.eqiad.wmnet, mw1289.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1285.eqiad.wmnet, mw1277.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [17:06:05] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:06:05] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:06:09] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} ( [17:06:09] metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} [17:06:09] test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:06:25] RECOVERY - Apache HTTP on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.211 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:06:27] RECOVERY - Apache HTTP on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.659 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:06:29] RECOVERY - Apache HTTP on mw1339 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.681 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:06:37] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.428 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:06:43] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:06:43] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.750 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:06:47] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:06:51] PROBLEM - PHP7 rendering on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:07:13] RECOVERY - Apache HTTP on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.411 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:07:13] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:07:19] RECOVERY - Apache HTTP on mw1408 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.708 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:07:21] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:07:21] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:07:25] RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 8.398 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:07:27] RECOVERY - PHP7 rendering on mw1343 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 8.231 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:07:29] RECOVERY - PHP7 rendering on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 9.612 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:07:29] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 8.290 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:07:31] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.420 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:07:39] RECOVERY - PHP7 rendering on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 8.643 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:07:39] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.208 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:07:41] RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 7.580 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:07:41] RECOVERY - PHP7 rendering on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 8.714 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:07:45] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 7.798 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:07:47] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:07:49] PROBLEM - test.wikidata.org dispatch lag is REALLY high ---4000s- on test.wikidata.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/71/ [17:07:57] RECOVERY - Apache HTTP on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 6.437 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:07] RECOVERY - PHP7 rendering on mw1276 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:13] RECOVERY - PHP7 rendering on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 6.502 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:13] RECOVERY - PHP7 rendering on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 5.345 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:15] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 5.568 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:15] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 6.240 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:19] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:08:19] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:08:19] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:08:21] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 4.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:21] RECOVERY - Apache HTTP on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 9.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:23] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:08:23] RECOVERY - PHP7 rendering on mw1278 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:25] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.491 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:27] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:08:27] RECOVERY - Apache HTTP on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 3.172 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:27] RECOVERY - PHP7 rendering on mw1297 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 4.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:29] RECOVERY - PHP7 rendering on mw1289 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 4.588 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:29] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 4.989 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:31] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 5.520 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:31] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:08:31] RECOVERY - PHP7 rendering on mw1281 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 5.673 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:35] RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 3.231 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:37] RECOVERY - PHP7 rendering on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 4.534 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:39] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 3.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:39] RECOVERY - PHP7 rendering on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 4.660 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:39] RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.673 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:39] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 3.440 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:41] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:08:41] RECOVERY - PHP7 rendering on mw1284 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.672 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:43] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:08:43] RECOVERY - Apache HTTP on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.871 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:47] RECOVERY - Apache HTTP on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 5.816 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:51] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1379.eqiad.wmnet, mw1378.eqiad.wmnet, mw1381.eqiad.wmnet, mw1375.eqiad.wmnet, mw1359.eqiad.wmnet, mw1356.eqiad.wmnet, mw1281.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [17:08:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:55] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [17:08:55] RECOVERY - PHP7 rendering on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:55] RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.209 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:55] RECOVERY - PHP7 rendering on mw1279 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:57] RECOVERY - PHP7 rendering on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.945 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:57] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:57] RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.336 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:08:57] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.420 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:57] RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.019 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:57] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.162 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:08:59] RECOVERY - PHP7 rendering on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 3.150 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:01] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [17:09:05] RECOVERY - Apache HTTP on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:09] RECOVERY - PHP7 rendering on mw1286 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.464 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:09] RECOVERY - PHP7 rendering on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 4.240 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:11] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [17:09:15] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:17] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:09:17] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:17] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 2.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:17] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:09:17] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:17] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:19] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:09:19] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:19] RECOVERY - PHP7 rendering on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:21] RECOVERY - PHP7 rendering on mw1282 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.409 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:21] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:21] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:23] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:23] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [17:09:25] RECOVERY - Apache HTTP on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.981 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:27] RECOVERY - PHP7 rendering on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:29] RECOVERY - PHP7 rendering on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.240 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:29] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:29] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:33] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:33] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:33] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:33] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:33] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:39] RECOVERY - PHP7 rendering on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:39] RECOVERY - Apache HTTP on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:39] RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.603 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:43] RECOVERY - PHP7 rendering on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.609 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:43] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:43] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [17:09:43] RECOVERY - PHP7 rendering on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:45] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.410 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:45] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:45] RECOVERY - Apache HTTP on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:45] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:47] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:49] RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:49] RECOVERY - Apache HTTP on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:49] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:09:49] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 145 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:09:49] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:51] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:09:55] RECOVERY - PHP7 rendering on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.588 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:59] RECOVERY - PHP7 rendering on mw1287 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:10:01] RECOVERY - PHP7 rendering on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 2.784 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:10:01] RECOVERY - Apache HTTP on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.429 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:10:01] RECOVERY - PHP7 rendering on mw1288 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:10:03] RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:10:03] RECOVERY - Apache HTTP on mw1297 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.773 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:10:07] RECOVERY - PHP7 rendering on mw1285 is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 1.463 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:10:07] RECOVERY - Apache HTTP on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.993 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:10:09] RECOVERY - Apache HTTP on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.198 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:10:09] RECOVERY - PHP7 rendering on mw1290 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.347 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:10:09] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:10:13] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:13] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:10:13] RECOVERY - PHP7 rendering on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.708 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:10:15] RECOVERY - PHP7 rendering on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:10:15] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:15] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:15] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:15] RECOVERY - PHP7 rendering on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.803 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:10:15] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:15] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:10:17] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07576 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:10:17] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:10:20] RECOVERY - LVS api eqiad port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24411 bytes in 1.844 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:10:20] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:10:25] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:25] RECOVERY - Apache HTTP on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 642 bytes in 1.620 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:10:27] RECOVERY - Apache HTTP on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:10:35] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:43] RECOVERY - PHP7 rendering on mw1283 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:10:43] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:45] RECOVERY - Apache HTTP on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.925 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:10:45] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:10:47] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 302 Found - 641 bytes in 0.706 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:10:47] RECOVERY - PHP7 rendering on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 655 bytes in 0.681 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:10:51] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 20 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:10:55] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:11:15] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6773 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [17:11:15] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:12:27] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:12:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 25%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14824 and previous config saved to /var/cache/conftool/dbconfig/20210314-171240-root.json [17:12:47] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:49] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [17:13:41] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:13:42] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6175 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [17:13:55] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:14:15] RECOVERY - test.wikidata.org dispatch lag is REALLY high ---4000s- on test.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 2069 bytes in 0.066 second response time https://phabricator.wikimedia.org/project/view/71/ [17:14:36] (03PS1) 10Marostegui: Revert "db1146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/671242 [17:15:25] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:15:28] (03CR) 10Marostegui: [C: 03+2] Revert "db1146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/671242 (owner: 10Marostegui) [17:16:33] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:16:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:09] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:17:17] PROBLEM - SSH on mw2236.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:19:35] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [17:21:13] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:26:19] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:27:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 50%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14825 and previous config saved to /var/cache/conftool/dbconfig/20210314-172744-root.json [17:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 75%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14826 and previous config saved to /var/cache/conftool/dbconfig/20210314-174248-root.json [17:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 100%: Repool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14827 and previous config saved to /var/cache/conftool/dbconfig/20210314-175751-root.json [17:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:32] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [18:18:04] RECOVERY - SSH on mw2236.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:20:05] 10SRE, 10Data-Persistence, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10RhinosF1) [18:20:34] akosiaris: put what I know and tagged a few relevant teams from what I saw in here and -sre [18:23:55] thanks [18:24:16] Np [18:31:45] 10SRE, 10Data-Persistence, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10Legoktm) > This also brought down any third party wiki using Instant Commons. The wikis actually went down? Or just couldn't serve new images or...? [18:32:16] 10SRE, 10Data-Persistence, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10Legoktm) [18:32:27] 10SRE, 10Data-Persistence, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10RhinosF1) >>! In T277417#6912136, @Legoktm wrote: >> This also brought down any third party wiki using Instant Commons. > > The wikis actually went down? Or just couldn't serve... [18:42:59] 10SRE, 10Data-Persistence, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10Marostegui) The database host that got overloaded was db1144, which serves the following groups: * contributions * recentchanges * recentchangeslinked * watchlist * logpager Tra... [18:55:59] 10SRE, 10Data-Persistence, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10RhinosF1) [18:57:12] legoktm: iirc, with instant commons it's the backend that does the request to wm servers rather than client [19:09:44] RhinosF1: what request? MediaWiki calls the API to get image sizes etc., but upload.wikimedia.org can be hotlinked [19:10:20] Nemo_bis: not sure tbh [19:10:28] I never looked into the issue [19:11:00] Because the last time it happened was the DDoS [19:11:38] The sole reason $wgUseInstantCommons is disabled by default is that we don't want to assume MediaWiki is supposed to access the internet [19:12:11] Yeah but many wikis of ours do have it on [19:12:30] Nope we don't use InstantCommons [19:12:43] Oh right I forgot I was the one who changed the default to hotlink ^_^ https://gerrit.wikimedia.org/r/c/mediawiki/core/+/251556/ [19:12:44] No I mean our wikis that I host [19:12:49] ah ok [19:12:54] Like third party [19:16:07] I hope you're not serving those pages uncached; API requests to Commons add up and the parsing gets super slow [19:16:46] At one point, API requests to Commons were the vast majority of the time spent to serve the wiki.wikimedia.it main page :/ (but we're going offtopic here) [19:17:30] There's nothing magic beyond instant commons being on but it's one of my follow ups [19:20:34] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [20:28:52] PROBLEM - Disk space on Hadoop worker on an-worker1107 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 21 GB (0% inode=99%): /var/lib/hadoop/data/e 16 GB (0% inode=99%): /var/lib/hadoop/data/m 26 GB (0% inode=99%): /var/lib/hadoop/data/g 24 GB (0% inode=99%): /var/lib/hadoop/data/c 24 GB (0% inode=99%): /var/lib/hadoop/data/b 24 GB (0% inode=99%): /var/lib/hadoop/data/i 26 GB (0% inode=99%): /var/lib/hadoop/data/l [20:28:52] 99%): /var/lib/hadoop/data/d 26 GB (0% inode=99%): /var/lib/hadoop/data/f 26 GB (0% inode=99%): /var/lib/hadoop/data/k 26 GB (0% inode=99%): /var/lib/hadoop/data/j 24 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:29:10] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [20:35:02] 10SRE, 10Performance-Team, 10Traffic: Enable webp thumbnails on all images for non-Commons wikis - https://phabricator.wikimedia.org/T269946 (10AntiCompositeNumber) [20:44:04] PROBLEM - Disk space on Hadoop worker on an-worker1115 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 25 GB (0% inode=99%): /var/lib/hadoop/data/c 25 GB (0% inode=99%): /var/lib/hadoop/data/b 26 GB (0% inode=99%): /var/lib/hadoop/data/j 23 GB (0% inode=99%): /var/lib/hadoop/data/m 24 GB (0% inode=99%): /var/lib/hadoop/data/g 25 GB (0% inode=99%): /var/lib/hadoop/data/k 24 GB (0% inode=99%): /var/lib/hadoop/data/i [20:44:04] 99%): /var/lib/hadoop/data/f 25 GB (0% inode=99%): /var/lib/hadoop/data/l 25 GB (0% inode=99%): /var/lib/hadoop/data/e 25 GB (0% inode=99%): /var/lib/hadoop/data/d 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:47:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:48:20] PROBLEM - Disk space on Hadoop worker on an-worker1115 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 25 GB (0% inode=99%): /var/lib/hadoop/data/c 25 GB (0% inode=99%): /var/lib/hadoop/data/b 27 GB (0% inode=99%): /var/lib/hadoop/data/j 24 GB (0% inode=99%): /var/lib/hadoop/data/m 25 GB (0% inode=99%): /var/lib/hadoop/data/g 25 GB (0% inode=99%): /var/lib/hadoop/data/k 24 GB (0% inode=99%): /var/lib/hadoop/data/i [20:48:20] 99%): /var/lib/hadoop/data/f 24 GB (0% inode=99%): /var/lib/hadoop/data/l 26 GB (0% inode=99%): /var/lib/hadoop/data/e 26 GB (0% inode=99%): /var/lib/hadoop/data/d 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:49:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:50:40] RECOVERY - Disk space on Hadoop worker on an-worker1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:52:48] RECOVERY - Disk space on Hadoop worker on an-worker1107 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:04:22] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:17:32] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [22:27:45] 10SRE, 10DBA, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10LSobanski) [23:08:24] RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.142e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:15:28] (03PS1) 10H.krishna123: [T277160] Added execution time for recover-dump, added unit tests [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 [23:44:27] (03CR) 10Reedy: [C: 04-1] [T277160] Added execution time for recover-dump, added unit tests (033 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 (owner: 10H.krishna123)