[06:06:33] (03PS1) 10Rxy: Add transwiki import sources in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592330 (https://phabricator.wikimedia.org/T250972) [06:36:02] (03CR) 10RhinosF1: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592330 (https://phabricator.wikimedia.org/T250972) (owner: 10Rxy) [08:10:20] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10elukey) The codfw cluster is currently yellow, from explain I see a lot of `"explanation" : "node does not match index setting [index.routing.allocation.require] filters [disktype:\"hdd\"]"` I acked the alerts... [08:11:09] cc: godog, herron, shdubsh --^ [08:16:28] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:25:40] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 22739 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:36:54] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:40:26] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22724 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:48:06] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:55:26] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22730 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:34:24] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:38:02] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22746 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:13:46] !log apply T250071 on s10 (labswiki) [11:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:57] T250071: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 [13:15:22] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:26:28] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.0 200 OK - 22751 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:24:06] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.618e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:25:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [14:26:12] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [14:26:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:27:44] ouch rx bandwidth saturation for mc1034 --^ [14:29:48] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 57 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:30:00] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [14:30:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:32:54] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:07:05] gonna restart wdqs1006.. is locked again [15:08:12] !log depool and restart wdqs1006 to catch up with lag after deadlock T242453 [15:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:20] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [15:08:55] 2.2 hours lagged :/ [15:12:28] !log cdanis@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=eqiad [15:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:53] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration: Stop advertising webmaster@wikimedia.org in apache configs - https://phabricator.wikimedia.org/T251005 (10Reedy) [18:13:30] (03CR) 10Esanders: VisualEditor: Allow external link paste on officewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591416 (owner: 10Esanders) [18:13:35] (03PS2) 10Esanders: VisualEditor: Allow external link paste on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591416 [18:50:44] !log restart elasticsearch on logstash2020 [18:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:36] !log restart elasticsearch on logstash2021 [18:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:27] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10colewhite) there was some work to rotate old indexes to spinning disks but the cluster knew of no nodes with the "hdd" disktype attribute. it looks like the configuration was stale and restarting logstash[2021... [19:41:31] !log applying T114117 on labswiki (wikitech) [19:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:39] T114117: Drop externallinks.el_from_namespace on wmf databases - https://phabricator.wikimedia.org/T114117 [19:59:54] (03PS1) 10QEDK: Enable VisualEditor for more namespaces on vecwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592427 (https://phabricator.wikimedia.org/T250419) [20:01:02] PROBLEM - PHP opcache health on mw1407 is CRITICAL: CRITICAL: opcache full. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:14:08] (03CR) 10Jhedden: [C: 03+1] Horizon: replace nutcracker with mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592276 (owner: 10Andrew Bogott) [20:51:04] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:52:48] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:25:47] !log cdanis@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=eqiad [21:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:38] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 86 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:43:28] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 35 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:04:44] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:14] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:28:58] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 23.28 ms [23:32:37] (03PS1) 10Dereckson: Prune JMT blog from fr.planet.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/592438 (https://phabricator.wikimedia.org/T251001) [23:39:27] (03PS1) 10Dereckson: Prune non existing domains from Planet [puppet] - 10https://gerrit.wikimedia.org/r/592439 (https://phabricator.wikimedia.org/T168459)