[00:00:12] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:20] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:26] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:36] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:44] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:09] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Today even with {F31857474} all I got was {F31857476} Solution: Commons should simply use the same uploader as phabricator. [00:14:33] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) https://commons.wikimedia.org/w/index.php?title=Special:Upload&uploadformstyle=basic just gets up to 68% and says > Error > Our s... [00:16:59] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Next attempt got to 80%, then > Request from - via cp5008.eqsin.wmnet, ATS/8.0.7 > Error: 502, Next Hop Connection Failed at 2020-06... [00:18:49] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) OK, now got to 92%, and > Request from - via cp5008.eqsin.wmnet, ATS/8.0.7 > Error: 502, Next Hop Connection Failed at 2020-06-08... [00:21:24] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Nope, 92% is as far as one can get, before > Request from - via cp5008.eqsin.wmnet, ATS/8.0.7 > Error: 502, Next Hop Connection Fail... [00:35:08] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) > Upload warning > Copy uploads are not available from this domain. Great. [00:41:04] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Platonides) It doesn't make any sense that you can upload to phabricator, but not to commons. I would suspect some crazy with some intermedia... [00:41:38] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) OK finally succeeded via https://tools.wmflabs.org/url2commons/index.html copying from phabricator to https://commons.wikimedia.org/... [00:49:02] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) https://commons.wikimedia.beta.wmflabs.org/ : OK, created account... Uploading .... and of course at the very end... "None of the u... [01:02:30] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) I bet the form isn't even being sent. {F31857497} Is the biggest thing to make a curl of. And my network monitor doesn't show a lot... [01:07:23] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) > pppd default-asyncmap defaultroute lcp-echo-failure 7 lcp-echo-interval > 50 mtu 1492 noaccomp noauth noipdefault noproxyarp persi... [01:13:44] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Anyway, today I was given IP address 36.234.68.20, so when you check the server errors logs above, you will see me. [01:20:57] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) OK, uploading the file to Twitter, lots of healthy yellow seen in the icewm network monitor {F31857502} Unlike when uploading to Co... [04:15:34] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6231 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:17:00] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 31.25 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [04:17:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:17:48] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:17:52] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [04:17:52] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [04:17:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_mobileapps_cluster_eqiad,swagger_check_restbase_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:17:58] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:18:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:18:04] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:18:12] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet, cp1083.eqiad.wmnet, cp1089.eqiad.wmnet, cp1075.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down b [04:18:12] 6_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:18:18] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:18:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:18:18] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [04:18:26] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [04:18:30] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:18:32] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [04:18:34] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:18:36] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [04:18:46] PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/ [04:18:50] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [04:19:02] PROBLEM - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1089.eqiad.wmnet, cp1075.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: testlb6_4 [04:19:02] 1.eqiad.wmnet, cp1079.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:19:02] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [04:19:02] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqiad.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [04:19:04] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:19:04] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [04:19:12] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:19:16] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [04:19:18] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 170 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:19:20] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 32.29 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:19:22] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:19:34] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [04:19:36] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:19:36] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:19:42] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:19:42] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:19:42] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:19:48] PROBLEM - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:19:48] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:20:02] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.1 200 Ok - 32294 bytes in 0.446 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:20:06] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:20:06] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:20:26] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [04:20:30] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:20:30] <_joe_> oh fuck [04:20:32] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:20:38] <_joe_> I just opened my eyes [04:20:43] yeah [04:20:58] <_joe_> this is a problem with memcached AFAICT [04:21:10] PROBLEM - PHP7 rendering on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:21:10] <_joe_> give me 1 minute to come to my senses and sit down and I'll look [04:21:20] ack [04:21:24] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:21:26] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:21:26] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:21:30] RECOVERY - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 549 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:21:30] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:21:34] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:21:34] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:21:34] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:21:34] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:21:36] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16207 bytes in 0.510 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:21:36] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:21:45] I'm waking up to [04:21:46] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:21:47] hey [04:21:55] same [04:22:00] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: / [04:22:00] mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before [04:22:00] eceived: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [04:22:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:22:08] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:22:08] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:22:12] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 37204 bytes in 3.608 second response time https://wikitech.wikimedia.org/wiki/Phabricator [04:22:16] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:22:17] * shdubsh here [04:22:20] RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 172691 bytes in 0.042 second response time https://phabricator.wikimedia.org/project/view/1118/ [04:22:21] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:22:22] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:22:26] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:22:38] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:22:38] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:22:40] here [04:22:43] what's up [04:22:48] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:22:50] PROBLEM - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/71/ [04:22:52] <_joe_> ok, I'm looking at appservers as soon as I connect [04:22:52] RECOVERY - PHP7 rendering on mw1403 is OK: HTTP OK: HTTP/1.1 200 OK - 84213 bytes in 0.786 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:23:02] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.085e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:23:03] <_joe_> can someone be IC? [04:23:08] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:08] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:18] <_joe_> I am looking at mcrouter logs now [04:23:22] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:23:26] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:23:26] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:23:26] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:23:28] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:30] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:23:38] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:38] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:38] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:42] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [04:23:44] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:48] _joe_: I can [04:23:48] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:48] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:52] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:52] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:54] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [04:23:58] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:58] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:23:58] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:24:04] <_joe_> mc1029.eqiad.wmnet is the problem [04:24:20] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:24:24] PROBLEM - ATS TLS has reduced HTTP availability #page on icinga1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [04:24:38] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [04:24:40] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:24:45] <_joe_> enwiki:pcache:idhash:41768916-0!canonical [04:24:56] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:24:56] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:25:12] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:25:12] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:25:13] <_joe_> no idea what that is but I'm thinking of tearing down memcache on this host [04:25:24] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:25:24] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:25:32] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:25:34] <_joe_> 304k sized key [04:25:40] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:25:40] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:25:42] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:25:52] gutter pool shows a lot of activity about the time of the first page [04:25:56] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:26:00] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:26:00] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:26:27] <_joe_> shdubsh: it seems we're somewhat below when things get completely moved to the gutter [04:26:32] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:26:42] <_joe_> so I'm going to firewall off mc1029 now [04:26:52] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:27:10] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:27:10] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:27:15] PROBLEM - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:27:18] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:27:20] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:27:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:27:20] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:27:28] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:27:28] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:27:34] <_joe_> !log firewallingf off memcached on mc1029 [04:27:46] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:28:13] Reports in #wikipedia-en of enwiki being slow - assuming its the varnish issues above? [04:28:35] _joe_: Failed to log message to wiki. Somebody should check the error logs. [04:29:00] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:29:00] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:29:02] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:29:02] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:29:08] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16226 bytes in 5.573 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:29:10] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:29:10] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:29:18] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:29:26] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:29:28] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:29:32] <_joe_> let's see if this improves things [04:29:44] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [04:29:46] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [04:29:48] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:29:50] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:29:54] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:29:55] PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/ [04:29:58] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 3.887 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:30:02] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:30:06] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:30:16] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [04:30:20] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqiad.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [04:30:44] RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 84129 bytes in 1.261 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:30:46] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:30:46] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:31:03] PROBLEM - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary inbound port utilisation over 80% #page (asw2-b-eqiad.mgmt.eqiad.wmnet) // Primary outbound port utilisation over 80% #page (cr1-eqiad.wikimedia.org) https://wikitech.wikimedia.org/wiki/Network_monitoring%23LibreNMS_alerts [04:31:10] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:31:10] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:31:16] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [04:31:32] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:31:34] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:31:54] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:32:18] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 216 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:32:26] PROBLEM - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is CRITICAL: CRITICAL - exception while fetching the URL. 502 Server Error: Next Hop Connection Failed for url: https://en.planet.wikimedia.org/ https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [04:32:46] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:33:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:33:00] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16213 bytes in 6.584 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:33:20] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:33:40] PROBLEM - Memcached on mc1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [04:34:36] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:34:48] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:34:50] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:35:08] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:35:12] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:35:12] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:35:22] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:35:22] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:35:24] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/media-list/{title [04:35:24] from storage) is CRITICAL: Test Get media-list from storage returned the unexpected status 502 (expecting: 200): /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was recei [04:35:24] /page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the [04:35:24] 502 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [04:35:42] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:36:02] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [04:36:24] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:36:26] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:36:26] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 7.028 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:28] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 7.703 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:28] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:36:34] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 541 bytes in 2.342 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:34] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:34] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:36] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:36:36] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:36] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:36] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:36:38] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:40] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:46] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:46] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:46] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:46] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:46] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:47] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:48] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:36:48] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:48] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:58] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:36:58] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:00] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:02] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [04:37:02] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:02] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:02] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [04:37:04] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:04] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 37204 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Phabricator [04:37:06] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [04:37:08] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:37:12] RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 172692 bytes in 0.013 second response time https://phabricator.wikimedia.org/project/view/1118/ [04:37:20] RECOVERY - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is OK: OK - Website content is current (855 = 86400) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [04:37:20] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:20] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:20] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:22] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:22] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:22] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:26] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:26] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [04:37:34] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:34] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:34] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:37:36] RECOVERY - PyBal backends health check on lvs1013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:37:38] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [04:37:38] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:40] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:48] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:37:52] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:52] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:37:54] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:38:08] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [04:38:10] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:38:10] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:38:14] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:38:14] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:38:14] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:38:14] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:38:14] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:38:14] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:38:18] RECOVERY - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:38:22] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [04:38:22] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [04:38:22] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:38:22] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:38:23] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16226 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:38:28] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:38:28] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [04:38:42] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:38:44] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:38:44] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:38:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:38:46] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [04:38:50] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:38:52] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:38:52] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:38:54] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:39:08] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:39:08] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:39:10] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:39:12] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:39:12] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:39:12] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:39:18] RECOVERY - ATS TLS has reduced HTTP availability #page on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [04:39:28] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [04:39:36] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:39:36] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:39:54] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:39:56] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [04:40:14] RECOVERY - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1759 bytes in 0.068 second response time https://phabricator.wikimedia.org/project/view/71/ [04:41:38] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 94.07 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:43:00] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.0125 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [04:43:34] (03PS2) 10Catrope: Enable GrowthExperiments guidance everywhere behind feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) (owner: 10Gergő Tisza) [04:44:02] RECOVERY - LibreNMS has a critical alert #page on icinga1001 is OK: OK: zero critical LibreNMS alerts https://wikitech.wikimedia.org/wiki/Network_monitoring%23LibreNMS_alerts [04:45:36] <_joe_> !log de-firewalling mc1029 [04:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:26] RECOVERY - Memcached on mc1029 is OK: TCP OK - 0.001 second response time on 10.64.32.209 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [05:02:06] (03PS1) 10KartikMistry: Update cxserver to 2020-06-08-045500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/603141 (https://phabricator.wikimedia.org/T246319) [05:08:44] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 62 probes of 579 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:14:34] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 579 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:20:26] (03PS1) 10Vgutierrez: ATS: Increase ats-backend max connections and max active connections [puppet] - 10https://gerrit.wikimedia.org/r/603158 [05:22:49] !log Upgrade db1077 to 10.4.13 to test events memory leak [05:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:47] (03PS1) 10DannyS712: Remove TranslationNotifications user settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780) [06:00:16] (03PS2) 10DannyS712: Remove TranslationNotifications user settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780) [06:05:50] (03PS1) 10Abijeet Patro: TranslationNotifications: Remove username / password for sending messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603169 (https://phabricator.wikimedia.org/T144780) [06:06:16] (03CR) 10DannyS712: "Dupe of https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/603167/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603169 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [06:07:22] (03Abandoned) 10Abijeet Patro: TranslationNotifications: Remove username / password for sending messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603169 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [06:24:03] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10elukey) I am all for testing new versions of memcached to get experience, so on this front you'll always have my +1 :) Upstream is also very available to help and give feedba... [06:29:04] PROBLEM - Check systemd state on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:32] PROBLEM - MD RAID on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:18] PROBLEM - ores uWSGI web app on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:24] PROBLEM - DPKG on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:24] PROBLEM - Check size of conntrack table on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:26] this is the lovely celery oom, running puppet --^ [06:30:52] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:14] RECOVERY - Check size of conntrack table on ores1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:35:17] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10elukey) @Halfak have we reported the issue to upstream asking for an advice (https://github.com/unbit/uwsgi) ?... [06:40:20] RECOVERY - MD RAID on ores1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:45:07] 10Operations, 10SRE-tools, 10Continuous-Integration-Config, 10Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10hashar) 05duplicate→03Open Reopening since this is about adding shellcheck on any repository while T254480 is specific to puppet.git and covers... [06:52:45] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10hashar) The CI container is build using Buster which co... [07:01:12] RECOVERY - DPKG on ores1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:02:20] (03CR) 10Kormat: [C: 03+1] textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [07:05:35] !log Stop MySQL on labsdb1012 to clone labsdb1011 T249188 [07:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:39] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [07:10:20] (03PS2) 10Dzahn: admin: shell account for Yi-Ju Lu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/602756 (https://phabricator.wikimedia.org/T254130) [07:10:29] (03CR) 10Dzahn: [C: 03+2] admin: shell account for Yi-Ju Lu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/602756 (https://phabricator.wikimedia.org/T254130) (owner: 10Dzahn) [07:20:37] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) 05Open→03Resolved Hello YiJuLu, your shell account has been created now. Here is some more information about the SSH config you will need to... [07:21:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) cc: @diego done! [07:23:52] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) @YiJuLu Keep in mind your shell user is "lulu". [07:26:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10YiJuLu) @Dzahn Thanks a lot :) [07:27:37] (03CR) 10Dzahn: admin: create shell user for Daniel Cipoletti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn) [07:27:40] (03PS1) 10Marostegui: install_server: Allow reimage db1141 [puppet] - 10https://gerrit.wikimedia.org/r/603338 (https://phabricator.wikimedia.org/T252512) [07:27:53] (03PS2) 10Dzahn: admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) [07:28:37] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db1141 [puppet] - 10https://gerrit.wikimedia.org/r/603338 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [07:30:01] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10hashar) >>! In T254491#6196244, @jbond wrote: > i have made a first pass at the [[ https://wikitech.wikimedia.org/wiki/Incident_documentatio... [07:31:29] (03CR) 10Ayounsi: BGP: add transit links (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/602119 (https://phabricator.wikimedia.org/T250136) (owner: 10Ayounsi) [07:37:07] (03CR) 10Muehlenhoff: admin: create shell user for Daniel Cipoletti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn) [07:37:12] !log installing nodejs security updates [07:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:17] !log cr3-ulsfo protocols bgp group Transit4 family inet any -> unicast - T250136 [07:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:21] T250136: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 [07:41:33] !og restarting turnilo for nodejs security update [07:42:38] !log cr4-ulsfo protocols bgp group Transit4 family inet any -> unicast - T250136 [07:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:38] (03PS3) 10Dzahn: admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) [07:45:57] (03CR) 10Dzahn: admin: create shell user for Daniel Cipoletti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn) [07:46:43] !log push T250136 to esams/knams - T250136 [07:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:47] T250136: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 [07:46:51] (03CR) 10Muehlenhoff: [C: 03+1] admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn) [07:47:51] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Dzahn) Thanks @dr0ptp4kt . Created Kerberos user [https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principal_for_a_real_user] @... [07:48:15] (03PS4) 10Dzahn: admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) [07:48:54] (03CR) 10Dzahn: [C: 03+2] admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn) [07:50:10] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Dzahn) a:05Ferdi2005→03None [07:50:54] (03CR) 10JMeybohm: [C: 03+1] rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [07:54:30] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:02] 10Operations, 10Traffic: ats-backend throttles connections under heavy load - https://phabricator.wikimedia.org/T254714 (10Vgutierrez) [07:57:08] !log ran puppet on all stat* hosts for an access request (dcipoletti was added) - stat1006 systemd state broke right after, jupyter-dedcode-singleuser.service failed [07:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:03] (03PS2) 10Vgutierrez: ATS: Increase ats-backend max connections and max active connections [puppet] - 10https://gerrit.wikimedia.org/r/603158 (https://phabricator.wikimedia.org/T254714) [07:58:08] ! stat1006 stat1006 bash[40607]: /bin/bash: line 0: exec: jupyterhub-singleuser: not found [07:58:17] !log stat1006 bash[40607]: /bin/bash: line 0: exec: jupyterhub-singleuser: not found [07:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:33] !log push T250136 to eqord/eqdfw - T250136 [07:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:36] T250136: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 [07:59:07] (03CR) 10Ema: [C: 03+1] ATS: Increase ats-backend max connections and max active connections [puppet] - 10https://gerrit.wikimedia.org/r/603158 (https://phabricator.wikimedia.org/T254714) (owner: 10Vgutierrez) [07:59:13] 10Operations, 10Phabricator, 10Security-Team, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10Fae) Playing with the Tor browser this morning, a work-around could be to for users to keep trying new Tor circuits until they stop getting the Error 500 message. This appe... [08:02:15] !log push T250136 to codfw - T250136 [08:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:12] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable Thanos upload for services [puppet] - 10https://gerrit.wikimedia.org/r/602716 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:03:53] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: bail on check_command containing newlines [puppet] - 10https://gerrit.wikimedia.org/r/602669 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:04:02] (03PS2) 10Filippo Giunchedi: monitoring: bail on check_command containing newlines [puppet] - 10https://gerrit.wikimedia.org/r/602669 (https://phabricator.wikimedia.org/T252186) [08:06:16] (03CR) 10Vgutierrez: [C: 03+2] ATS: Increase ats-backend max connections and max active connections [puppet] - 10https://gerrit.wikimedia.org/r/603158 (https://phabricator.wikimedia.org/T254714) (owner: 10Vgutierrez) [08:06:48] vgutierrez: merging your change too [08:06:49] godog: feel free to merge mine if you're seeing a multiple warning :) [08:06:53] yeah that :D [08:06:57] thanks! [08:07:04] good timing :D [08:07:06] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:07] !log stat1006 moved broken jupyter-dedcode-singleuser.service out of /run/systemd/transient. systemctl reset-failed [08:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:43] !log upgrading mw1349-mw1383 to PHP 7.2.31 [08:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:52] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Dzahn) 05Open→03Resolved @dcipoletti Your shell account has been created now. Here is some more information about the SSH config you will need to j... [08:09:37] !log push T250136 to eqiad - T250136 [08:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:40] T250136: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 [08:17:36] !log push T250136 to eqsin - T250136 [08:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:40] T250136: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 [08:20:09] (03PS2) 10Filippo Giunchedi: thanos: add alerts for Thanos components [puppet] - 10https://gerrit.wikimedia.org/r/602633 (https://phabricator.wikimedia.org/T252186) [08:20:11] (03PS2) 10Filippo Giunchedi: prometheus: enable Thanos upload for services [puppet] - 10https://gerrit.wikimedia.org/r/602716 (https://phabricator.wikimedia.org/T252186) [08:20:13] (03PS5) 10Filippo Giunchedi: prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) [08:20:15] (03PS2) 10Filippo Giunchedi: prometheus: enable Thanos upload for k8s [puppet] - 10https://gerrit.wikimedia.org/r/602715 (https://phabricator.wikimedia.org/T252186) [08:20:17] (03PS2) 10Filippo Giunchedi: prometheus: enable Thanos upload for ops in esams [puppet] - 10https://gerrit.wikimedia.org/r/602717 (https://phabricator.wikimedia.org/T252186) [08:20:55] (03CR) 10Filippo Giunchedi: thanos: add alerts for Thanos components (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602633 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:21:22] (03CR) 10Ayounsi: [C: 03+2] BGP: add transit links [homer/public] - 10https://gerrit.wikimedia.org/r/602119 (https://phabricator.wikimedia.org/T250136) (owner: 10Ayounsi) [08:21:36] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add alerts for Thanos components [puppet] - 10https://gerrit.wikimedia.org/r/602633 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:22:52] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, and 2 others: ATS or Varnish incorrectly strips Content-Disposition header for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) a:03ema [08:30:20] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 57 probes of 661 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:32:37] (03PS1) 10Ayounsi: Remove unused or outdated esams AS-specific policy-statements [homer/public] - 10https://gerrit.wikimedia.org/r/603363 (https://phabricator.wikimedia.org/T250136) [08:32:47] (03PS1) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) [08:32:58] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Swift doesn't save or regenerate Content-Disposition: inline for thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) p:05Triage→03Medium a:05ema→03Gilles [08:33:10] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable Thanos upload for services [puppet] - 10https://gerrit.wikimedia.org/r/602716 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [08:33:23] (03PS1) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 [08:33:56] (03CR) 10jerkins-bot: [V: 04-1] phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [08:35:30] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 8 probes of 661 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:36:38] PROBLEM - Thanos sidecar cannot connect to Prometheus on icinga1001 is CRITICAL: cluster={bastion,prometheus} instance={bast3004:19900,bast4002:19900,bast5001:19900,prometheus1003:19900,prometheus1003:19903,prometheus1003:19905,prometheus1003:19906,prometheus1003:19907,prometheus1004:19900,prometheus1004:19903,prometheus1004:19905,prometheus1004:19906,prometheus1004:19907,prometheus2003:19900,prometheus2003:19903,prometheus2003:1 [08:36:38] 03:19906,prometheus2004:19900,prometheus2004:19903,prometheus2004:19905,prometheus2004:19906} job=thanos-sidecar prometheus=ops site={codfw,eqiad,eqsin,esams,ulsfo} https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [08:37:14] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Swift doesn't save Content-Disposition: inline for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) [08:38:11] (03PS2) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) [08:38:38] the thanos alert is me [08:38:52] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Swift doesn't save Content-Disposition: inline for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) By poking at objects stored in Swift I've been able to establish that jpg thumbnails have t... [08:40:04] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Thumbor doesn't save Content-Disposition: inline headers to Swift for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) [08:40:06] (03PS2) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 [08:40:14] (03CR) 10Ayounsi: [C: 03+2] "The only provider we actually do something with the communities (Init7) don't use their currently valid communities https://www.as13030.ne" [homer/public] - 10https://gerrit.wikimedia.org/r/603363 (https://phabricator.wikimedia.org/T250136) (owner: 10Ayounsi) [08:40:19] (03PS3) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012) [08:40:39] (03Merged) 10jenkins-bot: Remove unused or outdated esams AS-specific policy-statements [homer/public] - 10https://gerrit.wikimedia.org/r/603363 (https://phabricator.wikimedia.org/T250136) (owner: 10Ayounsi) [08:40:45] (03PS1) 10Elukey: Switch backend for piwik.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/603366 (https://phabricator.wikimedia.org/T252740) [08:42:58] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Thumbor doesn't save Content-Disposition: inline headers to Swift for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) The same might actually be true for all thumbnails, but might be masked... [08:44:02] PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.421e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:44:26] PROBLEM - Prometheus prometheus1003/services restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [08:45:02] known ^ [08:47:57] (03PS1) 10Elukey: matomo: remove unnecessary plugin [puppet] - 10https://gerrit.wikimedia.org/r/603369 (https://phabricator.wikimedia.org/T252740) [08:48:38] PROBLEM - Prometheus prometheus2004/services restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [08:49:46] PROBLEM - Prometheus prometheus2003/services restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [08:52:30] (03CR) 10RhinosF1: [C: 04-1] "hold for on-task discussion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012) (owner: 10RhinosF1) [08:52:36] (03CR) 10Elukey: [C: 03+2] matomo: remove unnecessary plugin [puppet] - 10https://gerrit.wikimedia.org/r/603369 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [08:52:59] (03CR) 10Dzahn: "currently I don't see why the script content seems to be empty in compiler: https://puppet-compiler.wmflabs.org/compiler1001/23051/phab100" [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [08:53:54] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Aklapper) @Jidanni: Could you please not create 12 comments in 70 minutes, but instead first run tests and then at the end properly summarize... [08:57:10] RECOVERY - Thanos compact has not run on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:00:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:04:56] PROBLEM - Thanos compact has disappeared from Prometheus discovery on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [09:06:42] RECOVERY - Prometheus prometheus1003/services restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [09:06:46] RECOVERY - Thanos compact has disappeared from Prometheus discovery on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [09:07:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:08:20] 10Operations, 10netops: Telia eqiad<->codfw (IC-307235) outage ref: 01171084 - https://phabricator.wikimedia.org/T254674 (10Dzahn) "we experienced a brief service disruption in a card in Selma, AL, impacting our transmission stretch between Atlanta and Houston. A cold reboot of the card restored service. We... [09:11:38] RECOVERY - Prometheus prometheus2004/services restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [09:12:53] RECOVERY - Prometheus prometheus2003/services restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [09:13:40] PROBLEM - More than one Thanos compact running on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:15:18] PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.421e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:24:24] (03CR) 10Volans: "Nice! Couple of minor things inline, all the rest are optional nits." (0311 comments) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 (owner: 10Elukey) [09:26:12] RECOVERY - Thanos compact has not run on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:26:22] RECOVERY - More than one Thanos compact running on icinga1001 is OK: (C)1 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:29:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:30:18] ooof sorry for the alert spam [09:30:44] (03PS4) 10Muehlenhoff: Extend Cassandra cookbook to also cover maps [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 [09:31:15] (03CR) 10Muehlenhoff: Extend Cassandra cookbook to also cover maps (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff) [09:32:28] !log Turning on puppet on gerrit1002 again to avoid starting to lag too far behind [09:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:03] (03CR) 10jerkins-bot: [V: 04-1] Extend Cassandra cookbook to also cover maps [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff) [09:33:18] qchris: 👍 [09:33:43] 10Operations, 10netops: Telia eqiad<->codfw (IC-307235) outage ref: 01171084 - https://phabricator.wikimedia.org/T254674 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks, all back to normal. [09:34:02] PROBLEM - Thanos compact has disappeared from Prometheus discovery on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [09:34:24] :-) [09:34:41] (03PS5) 10Muehlenhoff: Extend Cassandra cookbook to also cover maps [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 [09:38:53] (03PS1) 10Gilles: Store Content-Disposition header in Swift [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603386 (https://phabricator.wikimedia.org/T254557) [09:39:20] (03CR) 10jerkins-bot: [V: 04-1] Store Content-Disposition header in Swift [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603386 (https://phabricator.wikimedia.org/T254557) (owner: 10Gilles) [09:40:39] (03CR) 10Volans: "Minor nits inline and a question/suggestion." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (https://phabricator.wikimedia.org/T252807) (owner: 10Muehlenhoff) [09:41:18] RECOVERY - Thanos compact has disappeared from Prometheus discovery on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [09:41:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:44:25] (03CR) 10Jbond: [C: 03+1] "> Patch Set 2:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [09:45:41] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10jbond) >>! In T254480#6198075, @ArielGlenn wrote: > @jb... [09:46:53] !log installing gnutls28 security updates on buster (older releases not affected) [09:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:54] PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.421e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:48:06] PROBLEM - More than one Thanos compact running on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:48:19] (03PS1) 10Elukey: matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) [09:49:26] (03CR) 10jerkins-bot: [V: 04-1] matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [09:49:42] RECOVERY - Thanos compact has not run on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:49:52] RECOVERY - More than one Thanos compact running on icinga1001 is OK: (C)1 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:51:57] (03PS2) 10Elukey: matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) [09:53:04] (03CR) 10jerkins-bot: [V: 04-1] matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [10:00:42] (03PS1) 10Filippo Giunchedi: swift: enable bulk and slo middlewares for s3api compat [puppet] - 10https://gerrit.wikimedia.org/r/603394 (https://phabricator.wikimedia.org/T252186) [10:00:42] PROBLEM - More than one Thanos compact running on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [10:08:15] (03CR) 10Jbond: puppet-merge: fix shellcheck issues (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [10:08:35] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/23053/" [puppet] - 10https://gerrit.wikimedia.org/r/603394 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [10:08:43] (03PS2) 10Filippo Giunchedi: swift: enable bulk and slo middlewares for s3api compat [puppet] - 10https://gerrit.wikimedia.org/r/603394 (https://phabricator.wikimedia.org/T252186) [10:08:52] (03PS3) 10Jbond: puppet-merge: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480) [10:10:40] (03CR) 10Jbond: puppet-merge: split dynamic values out of puppet-merge script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [10:10:46] (03PS3) 10Jbond: puppet-merge: split dynamic values out of puppet-merge script [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) [10:12:06] (03CR) 10Jbond: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/602649 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [10:13:57] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, and 2 others: Thumbor doesn't save Content-Disposition: inline headers to Swift for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10TheDJ) Good detective work @gilles ! [10:14:16] (03CR) 10Jbond: [C: 03+1] wmflib: add systemd.timer OnCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 (https://phabricator.wikimedia.org/T210818) (owner: 10Cwhite) [10:15:04] (03CR) 10Volans: [C: 04-1] "I see one issue with the way the current "cluster" is defined, see inline for the details." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff) [10:23:54] (03PS3) 10Elukey: matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) [10:25:00] (03CR) 10jerkins-bot: [V: 04-1] matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [10:27:05] (03CR) 10Volans: [C: 03+1] "LGTM, couple of optional nits inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 (owner: 10Jbond) [10:27:12] (03PS4) 10Elukey: matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) [10:28:04] 10Operations, 10Commons: Incorrect information in category table for commonswiki - https://phabricator.wikimedia.org/T254734 (10Base) [10:30:05] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T1030). [10:30:35] (03PS1) 10Dzahn: wikistats (cloud): update query for XML dump of Wikipedia table [puppet] - 10https://gerrit.wikimedia.org/r/603407 (https://phabricator.wikimedia.org/T254214) [10:31:33] (03CR) 10Dzahn: [C: 03+2] wikistats (cloud): update query for XML dump of Wikipedia table [puppet] - 10https://gerrit.wikimedia.org/r/603407 (https://phabricator.wikimedia.org/T254214) (owner: 10Dzahn) [10:31:41] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603408 (https://phabricator.wikimedia.org/T128546) [10:32:57] (03PS1) 10Ayounsi: Depool codfw for routre upgrade [dns] - 10https://gerrit.wikimedia.org/r/603409 (https://phabricator.wikimedia.org/T243080) [10:33:13] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603408 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:05] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603408 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:37:46] (03PS2) 10Ayounsi: Depool codfw for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/603409 (https://phabricator.wikimedia.org/T243080) [10:38:39] (03CR) 10Vgutierrez: [C: 03+1] Depool codfw for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/603409 (https://phabricator.wikimedia.org/T243080) (owner: 10Ayounsi) [10:38:59] (03CR) 10Ayounsi: [C: 03+2] Depool codfw for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/603409 (https://phabricator.wikimedia.org/T243080) (owner: 10Ayounsi) [10:39:03] (03PS5) 10Elukey: matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) [10:39:17] !log depool codfw - T243080 [10:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:35] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:603408| Bumping portals to master (603408)]] (duration: 01m 09s) [10:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:33] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:603408| Bumping portals to master (603408)]] (duration: 00m 57s) [10:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:53] !log bump all cr1-codfw OSPF metrics - T243080 [10:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:24] !log deactivate cr1-codfw transit/peering - T243080 [10:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:07] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on RT [puppet] - 10https://gerrit.wikimedia.org/r/603412 (https://phabricator.wikimedia.org/T135991) [10:46:14] !log install Junos on cr1-codfw:re1 (backup) - T243080 [10:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:51] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/603414 (https://phabricator.wikimedia.org/T135991) [10:50:23] (03CR) 10Dzahn: [C: 03+2] Enable base::service_auto_restart for Apache on RT [puppet] - 10https://gerrit.wikimedia.org/r/603412 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:51:36] (03CR) 10Volans: "Looks sane, although I'm not familiar with the PDU interface to say if that part is correct or not. Minor things inline." (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [10:51:41] !log reboot cr1-codfw:re1 (backup) - T243080 [10:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:35] (03CR) 10Dzahn: [C: 03+2] Enable base::service_auto_restart for Apache on peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/603414 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:56:57] !log do cr1-codfw RE mastership switch - T243080 [10:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:26] (03CR) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [10:59:03] waiting for linecards to reboot [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T1100). [11:00:05] tgr and lucaswerkmeister: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:19] o/ [11:00:26] o/ [11:00:47] lucaswerkmeister is about to join us, his IRC client is being slow for some reason [11:00:52] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:00:56] tgr: do you want to deploy yourself or should I do it? [11:01:03] (03CR) 10Volans: [C: 04-1] "Looks mostly ok to me. I can't assure all the junos commands are correct though. One error, see inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 (owner: 10Ayounsi) [11:01:16] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:02:43] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:02:55] Lucas_WMDE: I'll leave it to you, thanks! [11:03:04] ok! [11:03:06] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:03:23] (03PS3) 10Lucas Werkmeister (WMDE): Enable GrowthExperiments guidance everywhere behind feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) (owner: 10Gergő Tisza) [11:03:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) (owner: 10Gergő Tisza) [11:04:24] (03Merged) 10jenkins-bot: Enable GrowthExperiments guidance everywhere behind feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) (owner: 10Gergő Tisza) [11:04:25] apparently lucaswerkmeister isn’t allowed to join this channel :S [11:04:41] so I guess I’ll do both halves of the deployment from this nickname, meh [11:04:53] (it’s a volunteer change that I’m SWAT deploying as WMDE staff) [11:04:59] anyways, tgr first [11:05:00] !log install Junos on cr1-codfw:re0 (backup) - T243080 [11:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:29] tgr: change is on mwdebug1001, can you test it? [11:05:34] !log Install events on es1 T254689 [11:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:38] T254689: Check that all core hosts have events installed and enabled - https://phabricator.wikimedia.org/T254689 [11:05:45] the channel should be open to anyone as long as they are registered to services [11:05:55] testing [11:06:10] I thought that nick was registered but I guess I’m wrong [11:07:12] Lucas_WMDE: it is, you just haven't identified to it [11:07:39] From its client do "/msg NickServ identity yourpassword" [11:08:06] Lucas_WMDE: it works, thanks! [11:08:36] ok, syncing [11:09:41] o/ [11:09:56] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:601409|Enable GrowthExperiments guidance everywhere behind feature flag (T253794)]] (duration: 00m 57s) [11:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:00] T253794: Newcomer tasks: hidden preference for guidance - https://phabricator.wikimedia.org/T253794 [11:10:00] thanks RhinosF1, that fixed it [11:10:13] :) [11:10:28] (03PS2) 10Lucas Werkmeister (WMDE): Remove Wikibase idBlacklist setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602981 (https://phabricator.wikimedia.org/T254686) (owner: 10Lucas Werkmeister) [11:10:46] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602981 (https://phabricator.wikimedia.org/T254686) (owner: 10Lucas Werkmeister) [11:11:01] !log reboot cr1-codfw:re0 (backup) - T243080 [11:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:38] (03Merged) 10jenkins-bot: Remove Wikibase idBlacklist setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602981 (https://phabricator.wikimedia.org/T254686) (owner: 10Lucas Werkmeister) [11:12:09] lucaswerkmeister: change is on mwdebug1001, please test [11:12:12] ok [11:12:18] let me find an example lexeme to create [11:12:52] created https://www.wikidata.org/wiki/Lexeme:L301969, ID looks normal [11:13:01] looks like everything’s still working [11:13:06] ok, syncing [11:13:13] (I’ll fill in the rest of the lexeme later) [11:13:34] I guess I’ll first sync Wikibase.php, so that wmgWikibaseIdBlacklist is no longer read [11:13:39] and then IS.php, so it’s no longer set either [11:13:41] I think that’s the safe order [11:13:46] and better than syncing all of wmf-config/ at once [11:15:16] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:602981|Remove Wikibase idBlacklist setting (T254686)]], part 1 (duration: 00m 56s) [11:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:20] T254686: Rename WikibaseRepo’s idBlacklist setting - https://phabricator.wikimedia.org/T254686 [11:15:26] !log cr1-codfw> request chassis routing-engine master switch - T243080 [11:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:23] PROBLEM - Host pfw3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [11:16:35] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:602981|Remove Wikibase idBlacklist setting (T254686)]], part 2 (duration: 00m 56s) [11:16:35] XioNoX: related? ^^^ [11:16:40] to the ongoing work [11:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:51] volans: shouldn't but probably [11:16:59] RECOVERY - Host pfw3-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [11:17:07] was there user impact? [11:17:11] nop [11:17:26] paged [11:17:31] I have acked it [11:18:00] I think monitoring lost a few pings [11:18:00] me too [11:18:08] I think that’s it for the EU SWAT [11:18:22] marostegui: something's weird, you ack'ed the recovery [11:18:30] shouldn't that recover? why need to be acked? [11:18:33] oh [11:18:33] <_joe_> so victorops didn't see the recovery as the recovery [11:18:34] I hope those codfw messages aren’t SWAT-related? [11:18:38] yeah [11:18:39] weird [11:18:39] <_joe_> Lucas_WMDE: no [11:18:41] weird [11:18:42] there is this: https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&from=1591604312681&to=1591615112681 [11:18:55] !log EU SWAT done [11:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:58] <_joe_> please someone report this to the observability team [11:19:07] I'll take crae of it [11:19:20] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 64, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:19:20] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:19:36] (03PS1) 10Marostegui: db1141: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603419 (https://phabricator.wikimedia.org/T252512) [11:19:48] I'm staying focused on my upgrade [11:19:59] but will look at the pfw3-codfw afterwards [11:20:32] (03CR) 10Marostegui: [C: 03+2] db1141: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603419 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [11:21:08] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:22:58] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 66, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:23:26] 10Operations, 10Traffic: ats-backend throttles connections under heavy load - https://phabricator.wikimedia.org/T254714 (10jbond) p:05Triage→03Medium [11:23:50] XioNoX: ack, it seems that the cr2/pfw3 connection was lost and I bet the icinga check was going over that one. [11:24:41] yeah it shouldn't though as I was only working on cr1 [11:25:17] I've sent an email to observability to investigate the VO side [11:25:29] 10Operations, 10Phabricator, 10Security-Team, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10jbond) p:05Triage→03Medium [11:26:16] 10Operations, 10Security, 10User-MoritzMuehlenhoff, 10User-jbond: Ferm sometimes (rarely) fails to reload - https://phabricator.wikimedia.org/T254477 (10jbond) p:05Triage→03Medium [11:27:27] (03CR) 10Muehlenhoff: Extend Cassandra cookbook to also cover maps (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff) [11:28:08] (03PS3) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) [11:28:21] !log cr1-codfw add graceful-switchover - T243080 [11:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:39] (03CR) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [11:29:23] !log cr1-codfw add graceful-restart - T243080 [11:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:43] 10Operations, 10Security, 10User-MoritzMuehlenhoff, 10User-jbond: Ferm sometimes (rarely) fails to reload - https://phabricator.wikimedia.org/T254477 (10jbond) I think the problem here is that when `/usr/sbin/ferm -nl --domain ip /etc/ferm/ferm.conf` is run it sometimes fails to resolve DNS hosts. We coul... [11:29:45] (03PS6) 10Muehlenhoff: Drop maps from supported clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 [11:30:26] !log cr1-codfw re-enable transit/peering - T243080 [11:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:13] (03PS4) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) [11:32:04] !log cr1-codfw set OSPF metrics back to normal - T243080 [11:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:35] 10Operations, 10Security, 10User-MoritzMuehlenhoff, 10User-jbond: Ferm sometimes (rarely) fails to reload - https://phabricator.wikimedia.org/T254477 (10MoritzMuehlenhoff) Agreed, it's definitely related to failing DNS lookups, this happened more often until https://github.com/wikimedia/puppet/commit/5e8e6... [11:33:37] (03CR) 10Hnowlan: [C: 03+1] beta: Allow using docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/601717 (https://phabricator.wikimedia.org/T251176) (owner: 10Alexandros Kosiaris) [11:33:49] 10Operations, 10Security, 10User-MoritzMuehlenhoff, 10User-jbond: Ferm sometimes (rarely) fails to reload - https://phabricator.wikimedia.org/T254477 (10jbond) might be better to resolve the DNS on the puppet master and only have IP's in the ferm config (no idea how much effort that would be though) [11:35:52] (03PS5) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) [11:36:57] (03CR) 10jerkins-bot: [V: 04-1] phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [11:38:01] !log fail vrrp master from cr2 to cr1 - T243080 [11:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:05] !log deactivate cr2-codfw transit/peering - T243080 [11:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:34] (03PS6) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) [11:41:41] !log de-pref cr2-codfw OSPF - T243080 [11:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:16] !log rolling restart of Apache on Kibana/7 host to pick up Gnu TLS security update [11:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:30] !log restarting slapd on ldap-corp* for Gnu TLS security update [11:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:32] (03CR) 10Dzahn: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [11:47:28] (03PS7) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) [11:49:32] !log reboot cr2-codfw:re1 (backup) - T243080 [11:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:07] !log Deploy schema change on s3 - T251188 [11:53:23] !log restarting dnsdist on malmok [11:53:44] !log cr2-codfw> request chassis routing-engine master switch - T243080 [11:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:55] T251188: ipb_address_unique has an extra column in production but not in the code (WAS: ipb_address_unique has an extra column in the code but not in production) - https://phabricator.wikimedia.org/T251188 [11:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:47] waiting for the linecards to boot up [11:56:03] (03PS8) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) [11:56:23] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:56:33] (03PS9) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) [11:57:17] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:57:29] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:57:38] (03CR) 10Dzahn: [C: 03+1] "works now: https://puppet-compiler.wmflabs.org/compiler1002/23059/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [11:58:33] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:58:47] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:58:49] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:01:32] (03CR) 10Dzahn: [C: 03+2] phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [12:04:40] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 88 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:05:49] (03PS7) 10Jbond: sre.pdus.rotate-password: split generic functions out to __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 [12:05:50] !log reboot cr2-codfw:re0 (backup) - T243080 [12:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:58] !log cr2-codfw> request chassis routing-engine master switch - T243080 [12:09:59] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 58 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:10:00] last one! [12:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:15] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:13:17] * kart_ is updating cxserver. [12:14:19] XioNoX: Is it OK to deploy? [12:14:20] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:14:48] kart_: I'm done in 30s [12:14:55] OK! [12:15:04] checking that router came back as expected [12:15:41] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:16:05] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:16:55] kart_: yep all good! [12:17:03] thx for asking! [12:17:11] :) [12:17:14] (03PS1) 10Kormat: Add native mysql spicerack moodule. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 [12:17:23] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-06-08-045500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/603141 (https://phabricator.wikimedia.org/T246319) (owner: 10KartikMistry) [12:17:55] (03Merged) 10jenkins-bot: Update cxserver to 2020-06-08-045500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/603141 (https://phabricator.wikimedia.org/T246319) (owner: 10KartikMistry) [12:18:01] !log Compress InnoDB on db2094:3311 T254462 [12:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:05] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [12:18:22] !log rollback cr2-codfw vrrp/ospf/bgp changes - T243080 [12:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:29] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 117 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:20:50] akosiaris: b3b13a6ae5ac4da5cfcafb65e28ca7b03b2a3069 seems there in deployment-chart, seems undeployed? [12:21:14] My mistake. Sorry. but, something for sure. [12:22:34] (03PS1) 10Ayounsi: Revert "Depool codfw for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/603436 [12:22:38] akosiaris: cxserver-0.0.19 - is it OK to deploy? [12:22:58] kart_: it's a bank holiday in Greece today [12:23:11] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool codfw for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/603436 (owner: 10Ayounsi) [12:23:20] Ouch. OK. I need to revert my merge then. [12:23:24] (03PS1) 10JMeybohm: lvs::configuration: add termbox-https [puppet] - 10https://gerrit.wikimedia.org/r/603437 (https://phabricator.wikimedia.org/T254581) [12:23:27] !log repool codfw - T243080 [12:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:43] (03PS1) 10KartikMistry: Revert "Update cxserver to 2020-06-08-045500-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/603438 [12:24:18] (03PS17) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) [12:25:11] (03CR) 10KartikMistry: [C: 03+2] "Need to check unmerged changes in deployment-charts repository before deploy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/603438 (owner: 10KartikMistry) [12:25:21] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:25:47] (03Merged) 10jenkins-bot: Revert "Update cxserver to 2020-06-08-045500-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/603438 (owner: 10KartikMistry) [12:27:58] (03CR) 10Kormat: "Dear reviewers," [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [12:29:24] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Dzahn) >>! In T252932#6174068, @JGulingan wrote: > Just to clarify, IT does not manage donate@. Can you clarify if this points to Fundraising's zendesk email? Yea, so far it's managed by SRE. That's why... [12:30:01] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Dzahn) a:03JGulingan [12:33:07] (03PS1) 10Cmjohnson: Add relforge100[34] to netboot cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/603440 (https://phabricator.wikimedia.org/T241791) [12:33:56] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) [12:34:23] (03CR) 10Cmjohnson: [C: 03+2] Add relforge100[34] to netboot cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/603440 (https://phabricator.wikimedia.org/T241791) (owner: 10Cmjohnson) [12:34:26] (03PS1) 10Filippo Giunchedi: Fix Thanos compact alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/603441 (https://phabricator.wikimedia.org/T252186) [12:35:51] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) [12:36:25] (03CR) 10Dzahn: "tested both scripts.. noticed one does not work due to sender address being gone.. fixing but unrelated to this change." [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [12:39:32] (03PS1) 10Dzahn: phabricator: change sender address of community_metrics mail [puppet] - 10https://gerrit.wikimedia.org/r/603445 [12:39:34] (03PS2) 10Filippo Giunchedi: Fix Thanos compact alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/603441 (https://phabricator.wikimedia.org/T252186) [12:39:47] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix Thanos compact alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/603441 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:41:49] (03PS1) 10Vgutierrez: ATS: Add support on tls.lua for http requests [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [12:44:52] (03PS35) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [12:45:44] (03PS4) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012) [12:47:10] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [12:47:41] (03PS5) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012) [12:52:01] (03CR) 10Volans: [C: 03+1] "LGTM if we want to go towards this approach. I'll to the people more involved in the related services to decide the direction to go." [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff) [12:52:51] (03PS1) 10RhinosF1: Enable WikiLove on slwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603453 [12:53:46] volans: I think the pfw3 issue was a race condition [12:54:16] (03CR) 10Volans: [C: 03+1] "Small nit on the help message, looks good otherwise." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 (owner: 10Jbond) [12:54:30] XioNoX: on the check side? [12:54:32] (03PS2) 10RhinosF1: Enable WikiLove on slwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603453 (https://phabricator.wikimedia.org/T254706) [12:55:24] the pfw facing port on cr1 probably came up a tad before the other ports, causing the pfw to try to route through it (because of MEDs) [12:55:37] (03CR) 10Elukey: [C: 03+2] matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [12:55:52] and because no other port up, traffic got blackholed [12:56:01] godog: volans: [12:56:10] sorry been so long since i did that :) [12:56:26] jbond42: 2 birds with one stone! :D [12:56:28] XioNoX: ack [12:56:52] (03PS36) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [12:58:44] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [13:01:48] (03CR) 10Muehlenhoff: matomo: move archive cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [13:03:14] (03PS1) 10Filippo Giunchedi: profile: allow NaN for Thanos compact/query errors [puppet] - 10https://gerrit.wikimedia.org/r/603457 (https://phabricator.wikimedia.org/T252186) [13:03:34] (03CR) 10jerkins-bot: [V: 04-1] profile: allow NaN for Thanos compact/query errors [puppet] - 10https://gerrit.wikimedia.org/r/603457 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:04:59] (03CR) 10Elukey: [C: 03+2] matomo: move archive cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [13:05:14] (03PS2) 10Filippo Giunchedi: profile: allow NaN for Thanos compact/query errors [puppet] - 10https://gerrit.wikimedia.org/r/603457 (https://phabricator.wikimedia.org/T252186) [13:05:16] (03PS6) 10Filippo Giunchedi: prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) [13:05:18] (03PS3) 10Filippo Giunchedi: prometheus: enable Thanos upload for k8s [puppet] - 10https://gerrit.wikimedia.org/r/602715 (https://phabricator.wikimedia.org/T252186) [13:05:21] (03PS3) 10Filippo Giunchedi: prometheus: enable Thanos upload for ops in esams [puppet] - 10https://gerrit.wikimedia.org/r/602717 (https://phabricator.wikimedia.org/T252186) [13:07:35] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: allow NaN for Thanos compact/query errors [puppet] - 10https://gerrit.wikimedia.org/r/603457 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:09:32] (03PS2) 10Kormat: Add native mysql spicerack moodule. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 [13:10:46] (03PS1) 10Elukey: profile::piwik::instance: fix archiver's settings [puppet] - 10https://gerrit.wikimedia.org/r/603460 (https://phabricator.wikimedia.org/T252740) [13:13:38] RECOVERY - More than one Thanos compact running on icinga1001 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [13:13:43] (03CR) 10Elukey: [C: 03+2] profile::piwik::instance: fix archiver's settings [puppet] - 10https://gerrit.wikimedia.org/r/603460 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey) [13:15:10] I'll be roll-restarting prometheus 'ops' instance, no impact expected [13:18:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:21:36] actually no that was a lie, no restart needed [13:30:26] (03CR) 10Volans: "Thanks for the patch! It's nice to see progress on this! I did a first pass, let's chat offline about the details and potential future exp" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [13:32:50] (03PS37) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [13:36:21] (03PS8) 10Jbond: sre.pdus.rotate-password: split generic functions out to __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 [13:36:29] (03PS38) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [13:36:40] PROBLEM - PHP opcache health on mw2241 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:36:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10ema) 05Open→03Declined >>! In T242767#6199410, @MrJaroslavik wrote: > Hey, can be fixed this problem?... [13:41:35] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [13:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:16] jumbo cluster --^ [13:45:00] (03CR) 10Jcrespo: "Please don't discuss fully offline, there are things that Riccardo won't know about our MySQL setup that I could help with." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [13:45:28] RECOVERY - PHP opcache health on mw2241 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:46:57] (03PS1) 10Elukey: sre.kafka.roll-restart-brokers: improve documentation readability [cookbooks] - 10https://gerrit.wikimedia.org/r/603473 [13:47:43] 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10CDanis) 05Open→03Resolved [13:49:00] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:49:52] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:50:44] (03CR) 10Jbond: [C: 03+1] "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [13:50:57] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update mitigations for T250887 (duration: 00m 57s) [13:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:02] 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10Aklapper) @CDanis: Please feel free to {nav icon=anchor,name=Edit Related Tasks... > Close As Duplicate} in the upper right corner. Thanks! [13:53:11] (03PS2) 10Vgutierrez: ATS: Add support on tls.lua for http requests [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) [13:53:12] 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10Aklapper) [13:54:49] (03PS4) 10Jbond: puppet-merge: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480) [13:54:53] 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10CDanis) I don't think they're strictly speaking duplicates; this task was for tracking the incident itself and writing the document in the first place; the ot... [13:56:59] (03CR) 10Jbond: [C: 03+2] puppet-merge: split dynamic values out of puppet-merge script [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [13:57:02] (03CR) 10Jbond: [C: 03+2] puppet-merge: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [13:58:34] (03PS1) 10Vgutierrez: mtail: Adjust ATS TTFB buckets [puppet] - 10https://gerrit.wikimedia.org/r/603475 [13:58:46] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) Hm, EventStreams uses the Server Sent Events for this very reason. I don't think anyone is exp... [13:58:49] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:58:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:57] (03PS1) 10Jbond: whiespace CR to check puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/603476 [14:00:54] (03CR) 10Jbond: [C: 03+2] whiespace CR to check puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/603476 (owner: 10Jbond) [14:00:56] !log updating puppet-merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/602738/4 [14:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:00] (03CR) 10Volans: [C: 03+1] "documentation only" [cookbooks] - 10https://gerrit.wikimedia.org/r/603473 (owner: 10Elukey) [14:03:25] (03CR) 10Elukey: [C: 03+2] sre.kafka.roll-restart-brokers: improve documentation readability [cookbooks] - 10https://gerrit.wikimedia.org/r/603473 (owner: 10Elukey) [14:05:48] 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10Aklapper) 05duplicate→03Resolved Ah. Sorry! [14:09:27] (03PS1) 10Jbond: puppet-merge: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/603479 [14:10:24] (03CR) 10Jbond: [C: 03+2] puppet-merge: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/603479 (owner: 10Jbond) [14:10:40] (03PS1) 10Jbond: Revert "whiespace CR to check puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/603480 [14:11:18] (03CR) 10Ema: [C: 03+1] mtail: Adjust ATS TTFB buckets [puppet] - 10https://gerrit.wikimedia.org/r/603475 (owner: 10Vgutierrez) [14:11:52] (03CR) 10Jbond: [C: 03+2] Revert "whiespace CR to check puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/603480 (owner: 10Jbond) [14:13:16] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 (owner: 10Jbond) [14:14:06] (03PS1) 10Ladsgroup: Wrap WAN-cached PropertyInfoLookup with an APCu cache [extensions/Wikibase] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603482 (https://phabricator.wikimedia.org/T254536) [14:14:15] (03CR) 10Hnowlan: [C: 03+2] changeprop: simplify config writing. make beta config write puppet-friendly YAML. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598026 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [14:14:57] (03Merged) 10jenkins-bot: changeprop: simplify config writing. make beta config write puppet-friendly YAML. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598026 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [14:15:17] (03PS1) 10Jbond: puppetmasters::scritps: join arrays [puppet] - 10https://gerrit.wikimedia.org/r/603483 [14:18:38] (03CR) 10CDanis: [C: 03+1] puppetmasters::scritps: join arrays [puppet] - 10https://gerrit.wikimedia.org/r/603483 (owner: 10Jbond) [14:18:49] jbond42: lmk if you need a hand [14:19:24] (03CR) 10Jbond: [C: 03+2] puppetmasters::scritps: join arrays [puppet] - 10https://gerrit.wikimedia.org/r/603483 (owner: 10Jbond) [14:19:46] cdanis: thanks i think i got it now, just a couple of silly issues that sliped through [14:20:41] that's the usual with that script, it'd be interesting to have a 'proper' testing environment for it, but that also sounds like a lot of work [14:22:14] yes and yes :) [14:22:17] (03PS2) 10Vgutierrez: mtail: Adjust ATS TTFB buckets [puppet] - 10https://gerrit.wikimedia.org/r/603475 (https://phabricator.wikimedia.org/T254714) [14:22:18] godog: volans: [14:22:28] bad day for it iapparently :( [14:23:40] puppet-merge all working again [14:23:58] (03CR) 10Vgutierrez: [C: 03+2] mtail: Adjust ATS TTFB buckets [puppet] - 10https://gerrit.wikimedia.org/r/603475 (https://phabricator.wikimedia.org/T254714) (owner: 10Vgutierrez) [14:25:15] (03PS1) 10Jbond: Revert "Revert "whiespace CR to check puppet-merge"" [puppet] - 10https://gerrit.wikimedia.org/r/603485 [14:26:06] (03CR) 10Volans: [C: 04-1] "one typo and one suggestion inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [14:26:09] (03CR) 10Filippo Giunchedi: [C: 04-1] "It looks like role::osm::common isn't used in production for katotherian etc, but rather role::maps (either ::master or ::slave) via profi" [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway) [14:26:43] jbond42: haha! [14:27:37] (03PS1) 10CDanis: fix puppet merge typos [puppet] - 10https://gerrit.wikimedia.org/r/603487 [14:28:03] (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/603487 (owner: 10CDanis) [14:28:20] (03CR) 10CDanis: [C: 03+2] fix puppet merge typos [puppet] - 10https://gerrit.wikimedia.org/r/603487 (owner: 10CDanis) [14:28:23] (03CR) 10Jbond: [C: 03+2] fix puppet merge typos [puppet] - 10https://gerrit.wikimedia.org/r/603487 (owner: 10CDanis) [14:28:56] Deploying this now: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/603482 [14:29:07] (03CR) 10Ladsgroup: [C: 03+2] Wrap WAN-cached PropertyInfoLookup with an APCu cache [extensions/Wikibase] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603482 (https://phabricator.wikimedia.org/T254536) (owner: 10Ladsgroup) [14:30:08] (03CR) 10Jbond: [C: 03+2] Revert "Revert "whiespace CR to check puppet-merge"" [puppet] - 10https://gerrit.wikimedia.org/r/603485 (owner: 10Jbond) [14:33:25] (03CR) 10Filippo Giunchedi: "Minor comments inline, LGTM overall" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:33:58] (03CR) 10CDanis: [C: 03+2] Systemd::Servicename: make it reflect reality e.g. php7.2-fpm [puppet] - 10https://gerrit.wikimedia.org/r/601460 (owner: 10CDanis) [14:34:30] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add loki_event filter script [puppet] - 10https://gerrit.wikimedia.org/r/602729 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:35:49] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "see comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603437 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [14:37:14] (03PS1) 10Dzahn: httpbb: convert an .erb.sh script to inline content [puppet] - 10https://gerrit.wikimedia.org/r/603490 (https://phabricator.wikimedia.org/T254480) [14:37:51] (03CR) 10Jbond: "Thanks, updated ill also preform some more testing tomorrow before this gets merged" (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [14:38:45] (03PS39) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) [14:41:52] !log upgrading mw API servers in codfw to PHP 7.2.31 [14:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:02] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: enable partitioned jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602430 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:45:20] (03PS3) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 [14:46:30] (03PS1) 10Dzahn: icinga: convert sync_icinga_state.sh.erb to file with config [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) [14:47:39] (03CR) 10jerkins-bot: [V: 04-1] icinga: convert sync_icinga_state.sh.erb to file with config [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [14:47:46] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕚☕ sudo cumin A:mw-canary 'disable-puppet "cdanis deploying I25ab44c1 T252605"' [14:47:48] (03CR) 10CDanis: [C: 03+2] textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [14:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:50] T252605: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 [14:48:02] (03PS2) 10Dzahn: icinga: convert sync_icinga_state.sh.erb to file with config [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) [14:48:09] (03CR) 10Kormat: Add native mysql spicerack module. (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat) [14:48:54] !log powering down ms-be2016 for BBU replacement [14:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:43] (03PS2) 10JMeybohm: lvs::configuration: add termbox-https [puppet] - 10https://gerrit.wikimedia.org/r/603437 (https://phabricator.wikimedia.org/T254581) [14:51:20] PROBLEM - Host ms-be2016 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:31] (03PS5) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [14:52:15] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [14:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:08] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕚☕ sudo cumin A:mw-canary 'enable-puppet "cdanis deploying I25ab44c1 T252605"' [14:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:12] T252605: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 [14:54:05] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10jbond) @Jclark-ctr ping: Are you able to respond to the comments and questions from Daniel above, thanks [14:54:38] (03CR) 10Dzahn: [C: 04-1] "Could not find any files from role/icinga/sync_icinga_state.sh" [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [14:55:17] (03PS2) 10Hnowlan: changeprop-jobqueue: enable partitioned jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602430 (https://phabricator.wikimedia.org/T220399) [14:57:55] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop-jobqueue: enable partitioned jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602430 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:58:13] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Jclark-ctr) @jbond have recently had issues with computer have reached out to IT will be reimaged [14:58:23] (03Merged) 10jenkins-bot: changeprop-jobqueue: enable partitioned jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602430 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:01:42] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10herron) 05Open→03Resolved a:03herron [15:02:54] (03Merged) 10jenkins-bot: Wrap WAN-cached PropertyInfoLookup with an APCu cache [extensions/Wikibase] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603482 (https://phabricator.wikimedia.org/T254536) (owner: 10Ladsgroup) [15:05:52] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:57] Deployed on mwdebug1001 and works fine [15:09:31] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.35/extensions/Wikibase/lib/includes/Store/CachingPropertyInfoLookup.php: Wrap WAN-cached PropertyInfoLookup with an APCu cache, Part I out of III (T254536) (duration: 00m 59s) [15:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:35] T254536: CacheAwarePropertyInfoStore performs 4000 Memc ops/s (APC not working?) - https://phabricator.wikimedia.org/T254536 [15:10:14] RECOVERY - Host ms-be2016 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [15:10:54] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.35/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: Wrap WAN-cached PropertyInfoLookup with an APCu cache, Part II out of III (T254536) (duration: 00m 57s) [15:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:02] 10Operations, 10ops-codfw: BBU faulty on ms-be2016 - https://phabricator.wikimedia.org/T252851 (10Papaul) 05Open→03Resolved BBU replacement complete [15:11:15] (03PS6) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 [15:11:50] PROBLEM - PHP opcache health on mw2199 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:12:35] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.35/extensions/Wikibase/client/includes/Store/Sql/DirectSqlStore.php: Wrap WAN-cached PropertyInfoLookup with an APCu cache, Part III out of III (T254536) (duration: 00m 57s) [15:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:29] _joe_: It's deployed now [15:13:43] (03PS2) 10MSantos: maps: profile::rsyslog::udp_localhost_compat [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway) [15:14:00] 10Operations, 10observability, 10Patch-For-Review: StatsD Exporter drops relayed metrics - https://phabricator.wikimedia.org/T239833 (10colewhite) 05Open→03Declined Still a problem, but probably not big enough to warrant the effort. [15:15:18] 10Operations, 10observability: mtail rc35 stops incrementing atsmtail counters - https://phabricator.wikimedia.org/T254192 (10colewhite) 05Open→03Resolved This issue hasn't resurfaced since disabling fsnotify. Moving forward with the upgrade. [15:16:12] RECOVERY - HP RAID on ms-be2016 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:19:10] PROBLEM - PHP opcache health on mw2139 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:19:26] PROBLEM - PHP opcache health on mw2138 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:20:50] PROBLEM - PHP opcache health on mw2136 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:20:53] <_joe_> Amir1: we had a drop in memcached requests around the time of the deploy :)) [15:20:57] https://grafana.wikimedia.org/d/000000316/memcache?panelId=21&fullscreen&orgId=1&from=1591627189995&to=1591629557781 [15:20:58] PROBLEM - PHP opcache health on mw2135 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:21:08] (03CR) 10Mholloway: maps: profile::rsyslog::udp_localhost_compat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway) [15:21:20] PROBLEM - PHP opcache health on mw2137 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:21:23] is this caused by my patch ^ [15:21:50] this is codfw [15:22:20] _joe_: \o/ my estimation is reduction of 25K reqs/s = 5% total requests [15:22:26] PROBLEM - PHP opcache health on mw2144 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:22:38] RECOVERY - PHP opcache health on mw2199 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:22:43] (03PS1) 10CDanis: expand phpfpm status text exporter to all appservers [puppet] - 10https://gerrit.wikimedia.org/r/603511 (https://phabricator.wikimedia.org/T252605) [15:22:46] PROBLEM - PHP opcache health on mw2140 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:22:47] <_joe_> don't worry about codfw, sigh [15:22:48] PROBLEM - PHP opcache health on mw2147 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:23:02] PROBLEM - PHP opcache health on mw2146 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:23:06] PROBLEM - PHP opcache health on mw2142 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:23:10] PROBLEM - PHP opcache health on mw2145 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:23:22] PROBLEM - PHP opcache health on mw2143 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:23:49] (03PS1) 10Ppchelko: Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) [15:24:41] !log hnowlan@deploy1001 Started deploy [cpjobqueue/deploy@07d8c32]: Disabling jobs migrated to k8s [15:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:48] (03CR) 10jerkins-bot: [V: 04-1] Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [15:25:25] afk for lunch [15:26:35] 10Puppet, 10Beta-Cluster-Infrastructure: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10Mholloway) This happened again over the weekend. I've restarted it. [15:27:00] (03PS2) 10Ppchelko: Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) [15:27:03] <_joe_> Amir1: didn't knew you moved to the US :D [15:27:09] <_joe_> or well south america [15:28:12] !log jynus@cumin2001 dbctl commit (dc=all): 'depool db2075 for mw maintenance T254139', diff saved to https://phabricator.wikimedia.org/P11411 and previous config saved to /var/cache/conftool/dbconfig/20200608-152811-jynus.json [15:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:16] T254139: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 [15:29:16] !log hnowlan@deploy1001 Finished deploy [cpjobqueue/deploy@07d8c32]: Disabling jobs migrated to k8s (duration: 04m 34s) [15:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:40] (03CR) 10CDanis: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/23066/" [puppet] - 10https://gerrit.wikimedia.org/r/603511 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [15:29:41] !log Migrated all cpjobqueue jobs from scb to Kubernetes [15:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:02] RECOVERY - PHP opcache health on mw2140 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:30:25] _joe_: lol, Canada :P [15:34:22] PROBLEM - PHP opcache health on wtp2008 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:37:35] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10Papaul) @hnowlan Hello any reason why this is still open? Thanks [15:40:01] (03PS1) 10Elukey: role::swap: remove access to analytics users [puppet] - 10https://gerrit.wikimedia.org/r/603522 (https://phabricator.wikimedia.org/T249752) [15:40:43] PROBLEM - PHP opcache health on mw2203 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:41:00] PROBLEM - PHP opcache health on mw2209 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:41:02] (03CR) 10Elukey: [C: 03+2] role::swap: remove access to analytics users [puppet] - 10https://gerrit.wikimedia.org/r/603522 (https://phabricator.wikimedia.org/T249752) (owner: 10Elukey) [15:41:02] PROBLEM - PHP opcache health on mw2204 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:41:04] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10hnowlan) 05Open→03Resolved [15:41:06] RECOVERY - PHP opcache health on mw2138 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:41:12] RECOVERY - PHP opcache health on mw2145 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:41:13] RECOVERY - PHP opcache health on mw2137 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:41:18] PROBLEM - PHP opcache health on mw2206 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:41:58] PROBLEM - PHP opcache health on mw2208 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:42:12] PROBLEM - PHP opcache health on mw2201 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:42:22] PROBLEM - PHP opcache health on mw2202 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:42:26] PROBLEM - PHP opcache health on mw2200 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:42:40] RECOVERY - PHP opcache health on mw2135 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:06] RECOVERY - PHP opcache health on mw2144 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:12] PROBLEM - PHP opcache health on mw2211 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:26] PROBLEM - PHP opcache health on mw2207 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:28] RECOVERY - PHP opcache health on mw2147 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:28] PROBLEM - PHP opcache health on mw2218 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:40] PROBLEM - PHP opcache health on mw2216 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:42] RECOVERY - PHP opcache health on mw2146 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:44] PROBLEM - PHP opcache health on mw2210 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:44] PROBLEM - PHP opcache health on mw2214 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:52] PROBLEM - PHP opcache health on mw2215 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:44:54] PROBLEM - PHP opcache health on mw2212 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:45:03] PROBLEM - PHP opcache health on mw2217 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:45:10] PROBLEM - PHP opcache health on mw2305 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:45:32] PROBLEM - PHP opcache health on mw2219 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:45:52] PROBLEM - PHP opcache health on mw2307 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:46:06] PROBLEM - PHP opcache health on mw2304 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:46:08] PROBLEM - PHP opcache health on mw2309 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:46:11] (03PS1) 10Ammarpad: Remove Mobile mainpage special casing from it and vec wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603524 (https://phabricator.wikimedia.org/T254731) [15:46:34] PROBLEM - PHP opcache health on mw2301 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:46:34] RECOVERY - PHP opcache health on mw2142 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:46:45] (03PS1) 10Elukey: profile::swap: skip deployment of mysql-credentials [puppet] - 10https://gerrit.wikimedia.org/r/603525 (https://phabricator.wikimedia.org/T249752) [15:46:58] RECOVERY - PHP opcache health on wtp2008 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:47:00] PROBLEM - PHP opcache health on mw2306 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:47:42] RECOVERY - PHP opcache health on mw2307 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:47:45] (03CR) 10Elukey: [C: 03+2] profile::swap: skip deployment of mysql-credentials [puppet] - 10https://gerrit.wikimedia.org/r/603525 (https://phabricator.wikimedia.org/T249752) (owner: 10Elukey) [15:48:26] PROBLEM - PHP opcache health on mw2303 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:48:33] (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603524 (https://phabricator.wikimedia.org/T254731) (owner: 10Ammarpad) [15:49:50] RECOVERY - PHP opcache health on mw2207 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:49:52] RECOVERY - PHP opcache health on mw2139 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:50:02] RECOVERY - PHP opcache health on mw2209 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:52:02] RECOVERY - PHP opcache health on mw2303 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:53:40] PROBLEM - PHP opcache health on mw2283 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:53:54] PROBLEM - PHP opcache health on mw2288 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:53:56] PROBLEM - PHP opcache health on mw2289 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:54:06] PROBLEM - PHP opcache health on mw2285 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:54:12] PROBLEM - PHP opcache health on mw2286 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:54:18] PROBLEM - PHP opcache health on mw2284 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:54:56] PROBLEM - PHP opcache health on mw2287 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:57:32] PROBLEM - PHP opcache health on mw2332 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:57:56] RECOVERY - PHP opcache health on mw2284 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:57:56] PROBLEM - PHP opcache health on mw2333 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:58:02] 👀 [15:58:14] PROBLEM - PHP opcache health on mw2298 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:58:14] PROBLEM - PHP opcache health on mw2331 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:58:32] PROBLEM - PHP opcache health on mw2299 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:58:34] PROBLEM - PHP opcache health on mw2257 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:58:48] PROBLEM - PHP opcache health on mw2254 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:59:08] PROBLEM - PHP opcache health on mw2290 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:59:08] PROBLEM - PHP opcache health on mw2256 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:59:20] PROBLEM - PHP opcache health on mw2293 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:59:20] PROBLEM - PHP opcache health on mw2291 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:59:22] PROBLEM - PHP opcache health on mw2255 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:00:04] PROBLEM - Check systemd state on mw2244 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:16] RECOVERY - PHP opcache health on mw2201 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:00:18] PROBLEM - PHP opcache health on mw2244 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:00:22] PROBLEM - PHP opcache health on mw2227 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:00:56] RECOVERY - PHP opcache health on mw2216 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:00:56] PROBLEM - PHP opcache health on mw2228 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:01:04] PROBLEM - PHP opcache health on mw2225 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:01:34] RECOVERY - PHP opcache health on mw2333 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:01:48] RECOVERY - PHP opcache health on mw2208 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:01:54] PROBLEM - PHP opcache health on mw2229 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:01:58] PROBLEM - PHP opcache health on mw2297 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:01:58] PROBLEM - PHP opcache health on mw2226 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:02:03] (03CR) 10Muehlenhoff: [C: 03+1] "Great work. If you want to test this on more baremetal servers, feel free to use the sretest* systems for this as well." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat) [16:02:36] (03PS1) 10Ppchelko: Beta: Switch from HTCP purging to kafka purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603530 (https://phabricator.wikimedia.org/T250781) [16:02:44] RECOVERY - PHP opcache health on mw2204 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:02:58] RECOVERY - PHP opcache health on mw2206 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:03:27] (03CR) 10RLazarus: [C: 03+1] "No strong feelings, LGTM if you prefer it this way. Thanks for reminding me that I need to get around to debianizing this." [puppet] - 10https://gerrit.wikimedia.org/r/603490 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [16:03:38] RECOVERY - PHP opcache health on mw2219 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:04:06] RECOVERY - PHP opcache health on mw2200 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:04:13] RECOVERY - PHP opcache health on mw2309 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:04:13] RECOVERY - PHP opcache health on mw2203 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:04:38] RECOVERY - PHP opcache health on mw2301 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:04:38] RECOVERY - PHP opcache health on mw2214 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:04:46] RECOVERY - PHP opcache health on mw2332 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:04:58] RECOVERY - PHP opcache health on mw2217 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:05:01] (03PS5) 10Urbanecm: Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21) [16:05:04] RECOVERY - PHP opcache health on mw2305 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:05:05] (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/603490 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn) [16:05:29] 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Prevention): Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) (//Moving to team inbox for next meeting.//) [16:05:33] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21) [16:05:40] PROBLEM - PHP opcache health on mw2367 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:05:54] RECOVERY - PHP opcache health on mw2211 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:05:56] PROBLEM - PHP opcache health on mw2361 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:06:10] PROBLEM - PHP opcache health on mw2364 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:06:22] PROBLEM - PHP opcache health on mw2369 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:06:34] RECOVERY - PHP opcache health on mw2215 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:06:36] RECOVERY - PHP opcache health on mw2255 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:07:06] PROBLEM - PHP opcache health on mw2368 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:07:16] PROBLEM - PHP opcache health on mw2365 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:07:20] PROBLEM - PHP opcache health on mw2363 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:07:20] RECOVERY - PHP opcache health on mw2229 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:07:24] PROBLEM - PHP opcache health on mw2366 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:07:36] RECOVERY - PHP opcache health on mw2299 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:07:40] PROBLEM - PHP opcache health on mw2198 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:08:02] (03CR) 10Eevans: [C: 03+1] Drop maps from supported clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff) [16:08:26] RECOVERY - PHP opcache health on mw2212 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:08:48] PROBLEM - PHP opcache health on mw2253 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:09:04] PROBLEM - PHP opcache health on mw2252 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:09:32] PROBLEM - PHP opcache health on mw2251 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:10:06] RECOVERY - PHP opcache health on mw2225 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:10:20] 10Operations, 10Analytics, 10Traffic: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10Milimetric) p:05Medium→03High [16:10:28] RECOVERY - PHP opcache health on mw2286 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:10:54] RECOVERY - PHP opcache health on mw2331 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:11:12] RECOVERY - PHP opcache health on mw2227 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:11:12] RECOVERY - PHP opcache health on mw2257 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:11:42] PROBLEM - PHP opcache health on mw2221 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:11:46] RECOVERY - PHP opcache health on mw2283 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:11:48] PROBLEM - PHP opcache health on mw2220 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:11:56] PROBLEM - PHP opcache health on mw2359 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:12:24] PROBLEM - PHP opcache health on mw2355 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:12:32] PROBLEM - PHP opcache health on mw2350 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:12:40] PROBLEM - PHP opcache health on mw2358 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:12:40] RECOVERY - PHP opcache health on mw2252 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:13:02] PROBLEM - PHP opcache health on mw2222 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:13:02] PROBLEM - PHP opcache health on mw2357 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:13:06] RECOVERY - PHP opcache health on mw2202 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:13:16] PROBLEM - PHP opcache health on mw2351 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:13:16] PROBLEM - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:13:24] PROBLEM - PHP opcache health on mw2223 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:13:36] RECOVERY - PHP opcache health on mw2228 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:13:48] RECOVERY - PHP opcache health on mw2288 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:14:00] RECOVERY - PHP opcache health on mw2285 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:14:14] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10Dzahn) What's up with the certificate renewal issue ? (15 days left). Does it need a separate task? [16:14:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, ran tests with https://wikitech.wikimedia.org/wiki/Thumbor#Local_development" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603386 (https://phabricator.wikimedia.org/T254557) (owner: 10Gilles) [16:15:06] (03CR) 10EBernhardson: [C: 03+1] "should be deployable in any SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595019 (owner: 10Mstyles) [16:15:20] RECOVERY - PHP opcache health on mw2221 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:15:30] RECOVERY - PHP opcache health on mw2210 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:15:32] PROBLEM - PHP opcache health on mw2371 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:15:52] RECOVERY - PHP opcache health on mw2306 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:16:00] RECOVERY - PHP opcache health on mw2355 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:16:26] PROBLEM - PHP opcache health on mw2376 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:16:26] PROBLEM - PHP opcache health on mw2374 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:16:26] RECOVERY - PHP opcache health on mw2226 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:16:36] PROBLEM - PHP opcache health on mw2372 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:17:20] PROBLEM - PHP opcache health on mw2375 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:17:26] RECOVERY - PHP opcache health on mw2293 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:18:08] RECOVERY - PHP opcache health on mw2298 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:18:16] RECOVERY - PHP opcache health on mw2297 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:19:03] RECOVERY - PHP opcache health on mw2290 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:19:16] RECOVERY - PHP opcache health on mw2291 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:20:38] RECOVERY - PHP opcache health on mw2218 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:21:04] RECOVERY - PHP opcache health on mw2289 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:22:24] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [16:23:38] RECOVERY - PHP opcache health on mw2366 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:23:48] RECOVERY - PHP opcache health on mw2244 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:24:06] RECOVERY - PHP opcache health on mw2304 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:24:16] RECOVERY - PHP opcache health on mw2364 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:25:10] RECOVERY - PHP opcache health on mw2368 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:25:18] RECOVERY - PHP opcache health on mw2365 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:25:23] RECOVERY - PHP opcache health on mw2363 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:25:28] RECOVERY - PHP opcache health on mw2376 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:25:56] RECOVERY - PHP opcache health on mw2254 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:26:16] RECOVERY - PHP opcache health on mw2369 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:27:00] (03PS1) 10Hnowlan: changeprop: remove changeprop from puppet [puppet] - 10https://gerrit.wikimedia.org/r/603534 (https://phabricator.wikimedia.org/T220399) [16:27:20] RECOVERY - PHP opcache health on mw2367 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:27:38] RECOVERY - PHP opcache health on mw2361 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:29:24] RECOVERY - PHP opcache health on mw2251 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:29:40] RECOVERY - PHP opcache health on mw2223 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:29:53] RECOVERY - PHP opcache health on mw2220 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:31:06] RECOVERY - PHP opcache health on mw2357 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:31:18] RECOVERY - PHP opcache health on mw2351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:31:20] RECOVERY - PHP opcache health on mw2136 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:31:42] RECOVERY - PHP opcache health on mw2256 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:31:50] RECOVERY - PHP opcache health on mw2359 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:32:26] RECOVERY - PHP opcache health on mw2350 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:32:52] RECOVERY - PHP opcache health on mw2222 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:33:08] RECOVERY - PHP opcache health on mw2353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:33:36] RECOVERY - PHP opcache health on mw2371 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:33:50] RECOVERY - PHP opcache health on mw2143 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:34:04] RECOVERY - PHP opcache health on mw2253 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:34:40] RECOVERY - PHP opcache health on mw2372 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:38:17] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10hnowlan) Yeah I think so - there are multiple hosts affected by this issue. Tracking in T254784 [16:39:32] PROBLEM - PHP opcache health on mw2319 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:39:47] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) I ran the mtr tests earlier in this bug report. Running a curl gives ` {"errors":[{"code":"empty-file","html":"The file you submit... [16:40:46] RECOVERY - PHP opcache health on mw2375 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:42:08] (03CR) 10MSantos: maps: profile::rsyslog::udp_localhost_compat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway) [16:43:00] (03PS3) 10MSantos: maps: profile::rsyslog::udp_localhost_compat [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway) [16:43:22] RECOVERY - PHP opcache health on mw2358 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:44:15] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10stjn) This is a very strange conclusion to this task. There was never an assumption that you do not need... [16:44:56] !log testing upcoming Scap release on beta [16:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:08] (03CR) 10Mholloway: [C: 03+1] maps: profile::rsyslog::udp_localhost_compat [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway) [16:45:34] PROBLEM - PHP opcache health on wtp2018 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:47:06] RECOVERY - PHP opcache health on mw2374 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:47:18] RECOVERY - PHP opcache health on mw2287 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:47:22] RECOVERY - PHP opcache health on mw2198 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:54:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [16:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:14] \o/ [16:55:50] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker [16:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:56] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:57:50] 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10Papaul) Before ` BIOS Version 2.4.3 Firmware Version 2.40.40.40 IP Address(es) 10.193.1.55 iDRAC MAC Address 84:7B:EB:F6:97:56 DNS Domai... [16:58:18] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:58:43] (03CR) 10Ladsgroup: "I will deploy this tomorrow if there's no objection." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602675 (https://phabricator.wikimedia.org/T111853) (owner: 10Ladsgroup) [16:58:44] 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10Papaul) 05Open→03Resolved @jcrespo firmware upgrade complete [16:58:47] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Papaul) [16:59:16] RECOVERY - PHP opcache health on mw2319 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:00:04] gehel and onimisionipe: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T1700) [17:00:06] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10BBlack) >>! In T242767#6201754, @Ottomata wrote: [reordering a little] > What happens right now if someon... [17:01:09] Krinkle: let me know when you want to backport the change [17:01:50] Amir1: link? [17:02:02] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/603536 [17:02:29] (03PS1) 10Krinkle: mediawiki.misc-authed-curate: Check for 'showrollbackconfirmation' preference [core] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603544 (https://phabricator.wikimedia.org/T254538) [17:02:32] Amir1: sure :) [17:02:50] oh neat, the wikibugs changed was rolled out [17:03:07] (03CR) 10Ladsgroup: [C: 03+2] "UBN" [core] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603544 (https://phabricator.wikimedia.org/T254538) (owner: 10Krinkle) [17:04:10] PROBLEM - PHP opcache health on mw2270 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:04:40] what wikibugs change? [17:06:15] (03PS1) 10Dave Pifke: Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) [17:06:32] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [17:06:44] RECOVERY - Check systemd state on mw2244 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:51] (03CR) 10jerkins-bot: [V: 04-1] Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:11:45] (03PS1) 10Catrope: GrowthExperiments: End A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603548 (https://phabricator.wikimedia.org/T254413) [17:13:10] RECOVERY - PHP opcache health on mw2270 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:13:22] (03CR) 10Cwhite: [C: 03+2] profile: add loki_event filter script [puppet] - 10https://gerrit.wikimedia.org/r/602729 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [17:14:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) [17:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:10] (03PS1) 10Dave Pifke: [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [17:18:43] (03CR) 10jerkins-bot: [V: 04-1] [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:19:17] (03PS10) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) [17:19:54] (03CR) 10Cwhite: profile: add loki output support to the logstash pipeline (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [17:21:12] (03CR) 10jerkins-bot: [V: 04-1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [17:21:12] PROBLEM - PHP opcache health on mw2275 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:21:32] RECOVERY - PHP opcache health on wtp2018 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:23:29] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) (Thanks for the response bblack!) > 2. Does the typical client handle the disconnect gracefull... [17:24:34] (03CR) 10RLazarus: [C: 03+1] expand phpfpm status text exporter to all appservers [puppet] - 10https://gerrit.wikimedia.org/r/603511 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [17:25:08] (03CR) 10CDanis: [C: 03+2] expand phpfpm status text exporter to all appservers [puppet] - 10https://gerrit.wikimedia.org/r/603511 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [17:28:17] (03PS1) 10Urbanecm: Do not require opt-in for guidance in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603552 (https://phabricator.wikimedia.org/T254789) [17:30:19] (03PS2) 10Dave Pifke: [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [17:31:19] (03Merged) 10jenkins-bot: mediawiki.misc-authed-curate: Check for 'showrollbackconfirmation' preference [core] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603544 (https://phabricator.wikimedia.org/T254538) (owner: 10Krinkle) [17:31:36] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson) [17:35:38] RECOVERY - PHP opcache health on mw2275 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:37:45] (03PS2) 10Mstyles: Update ML models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595019 (https://phabricator.wikimedia.org/T219534) [17:42:55] 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10jcrespo) Thank you for the help, putting the services back up. [17:43:07] (03PS1) 10Cmjohnson: Adding thanos-fe100[123] to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/603554 (https://phabricator.wikimedia.org/T251620) [17:43:16] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.35/resources/src/mediawiki.misc-authed-curate/rollback.js: Fix: Diff pages show rollback confirmation prompt if there is the "Mark as patrolled" link (T254538) (duration: 00m 59s) [17:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:20] T254538: Diff pages show rollback confirmation prompt if there is the "Mark as patrolled" link - https://phabricator.wikimedia.org/T254538 [17:49:27] 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10jcrespo) @marostegui there seems to be a bug on 10.1.45-MariaDB installed locally, as the systemd unit doesn't notify the start (despite actually g... [17:50:08] !log restart prometheus burrow exporter for kafka main on kafkamon1001 - T254498 [17:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:12] T254498: reset of burrow metrics for consumer group - https://phabricator.wikimedia.org/T254498 [17:51:30] Pchelolo: --^ [17:51:41] (03PS7) 10Privacybatm: Write documentation using Sphinx [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) [17:52:20] elukey: thank you! doesn't reflect on the graphs yet [17:52:40] (03CR) 10Cmjohnson: [C: 03+2] Adding thanos-fe100[123] to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/603554 (https://phabricator.wikimedia.org/T251620) (owner: 10Cmjohnson) [17:52:49] weird I still see cpjobqueue-low_traffic_jobs listed with details [17:52:57] in burrow I mean [17:53:38] Pchelolo: sure that the cgroup is not active anymore? [17:53:44] cpjobqueue-low_traffic_jobs [17:54:03] it should be still active, but for a different subset of topics [17:54:18] https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=cpjobqueue-low_traffic_jobs [17:54:24] this seems ok though --^ [17:54:28] the consumer group is valid, we just subscribed it to a wrong set of topics [17:54:49] the graph seems updated no? [17:55:03] or does it show old topics? [17:55:13] yup, looks good to me [17:55:16] thank you elukey! [17:55:19] super :) [17:57:07] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo trouble tickets TTN-0004144337, TTN-0004143746, and TTN-0004144096, for some reason. - The acknowledgement expires at: 2020-06-09 17:56:35. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:57:07] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo trouble tickets TTN-0004144337, TTN-0004143746, and TTN-0004144096, for some reason. - The acknowledgement expires at: 2020-06-09 17:56:35. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T1800). [18:00:04] Pchelolo and RoanKattouw: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:18] I can do the SWAT [18:00:58] cool, thank you RoanKattouw. Would you do mine or yours first? [18:01:02] the what now? ;) [18:01:12] Oh ey [18:01:16] Good rename, I like it [18:01:23] (I’m hanging around partly out of curiosity for the new log messages and the like) [18:02:35] OK this was only announced an hour ago, so I don't have to feel bad about missing the announcement email :) [18:02:39] ^^ [18:02:52] yeah maybe I should’ve been less cryptic sorry [18:02:58] (03CR) 10Krinkle: [C: 04-1] "Only use wg* for overriding core config keys. For things local to wmf-config, use wmg*." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:03:30] Pchelolo: Mind if I amend your patch to ---- yes that --^^ [18:03:35] yup [18:03:47] oh, I mean, 'I don't mind' [18:05:24] (03PS3) 10Catrope: Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:05:58] (03CR) 10Catrope: [C: 03+2] Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:06:38] RoanKattouw: I would need some time for testing it on mwdebug as well please [18:06:56] (03Merged) 10jenkins-bot: Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:07:51] Pchelolo: It's there, test away [18:07:56] thank you [18:10:07] RoanKattouw: mwdebug1001 or 1002? [18:10:12] 1002 sorry [18:11:16] RoanKattouw: All good! thank you [18:11:49] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) >>! In T254491#6200618, @hashar wrote: > The incident page 20200605-cloud-private-repo has the date the page has been cr... [18:15:28] (03PS2) 10Dave Pifke: Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) [18:16:27] (03CR) 10Jforrester: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [18:19:36] OK, deploying [18:19:41] 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10Marostegui) Yeah, I was testing the new version on that host with the new package and then I got into lots of others things. If you have some time... [18:20:29] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable HTCP purges for testwiki (T250781) (part 1) (duration: 00m 59s) [18:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:33] T250781: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781 [18:23:13] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Disable HTCP purges for testwiki (T250781) (part 2) (duration: 00m 56s) [18:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:04] (03PS2) 10Catrope: GrowthExperiments: End A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603548 (https://phabricator.wikimedia.org/T254413) [18:24:10] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: End A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603548 (https://phabricator.wikimedia.org/T254413) (owner: 10Catrope) [18:25:08] (03PS2) 10Urbanecm: GrowthExperiments: Do not require opt-in for guidance in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603552 (https://phabricator.wikimedia.org/T254789) [18:25:22] (03Merged) 10jenkins-bot: GrowthExperiments: End A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603548 (https://phabricator.wikimedia.org/T254413) (owner: 10Catrope) [18:28:01] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-fe1003.eqiad.wmnet ` The log can be found in... [18:28:19] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: End GrowthExperiments homepage A/B test (T254413) (duration: 00m 57s) [18:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:22] T254413: Variant tests: switch all newcomers to Variant A - https://phabricator.wikimedia.org/T254413 [18:28:56] RoanKattouw: if you could review my GE patch too, it would be cool [18:29:12] (also, ping me when done, would like to do a couple of things) [18:29:27] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Do not require opt-in for guidance in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603552 (https://phabricator.wikimedia.org/T254789) (owner: 10Urbanecm) [18:29:39] Thanks for finding and fixing that! [18:29:45] no problem! [18:30:09] Beta updates its config automatically every 10 minutes, so you'll probably have to wait a little bit [18:30:13] (03Merged) 10jenkins-bot: GrowthExperiments: Do not require opt-in for guidance in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603552 (https://phabricator.wikimedia.org/T254789) (owner: 10Urbanecm) [18:30:29] wasn't that in postmerge? [18:31:10] I think the beta deployment host getting the new config patch is in postmerge, but I don't think the full deployment (beta-scap-eqiad) is [18:31:19] gotcha [18:35:25] RoanKattouw: are you still deploying? [18:35:33] No, I'm done [18:39:27] okay, thx [18:39:53] (03PS6) 10Urbanecm: Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21) [18:40:00] (03CR) 10Urbanecm: [C: 03+2] Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21) [18:40:23] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe1003.eqiad.wmnet'] ` Of which those **FAILED**: ` ['thanos-fe1003.eqiad.wmnet'] ` [18:42:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:35] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:17] (03CR) 10Urbanecm: [C: 03+2] Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21) [18:48:08] (03Merged) 10jenkins-bot: Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21) [18:49:11] (03PS10) 10Urbanecm: IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) (owner: 10Zoranzoki21) [18:51:10] (03PS11) 10Urbanecm: IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) (owner: 10Zoranzoki21) [18:51:28] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0e85203: Enable subpages in Page namespace on napwikisource (T252755) (duration: 00m 58s) [18:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:32] T252755: Add subpages in ns Page for nap.source - https://phabricator.wikimedia.org/T252755 [18:52:33] (03CR) 10Urbanecm: [C: 03+2] IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) (owner: 10Zoranzoki21) [18:53:25] (03Merged) 10jenkins-bot: IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) (owner: 10Zoranzoki21) [18:54:43] PROBLEM - PHP opcache health on mw2233 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:55:03] (03PS6) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [18:55:12] !log urbanecm@deploy1001 sync-file aborted: SWAT: 1630a10: Set wgProofreadPagePageJoiner to __PAGEJOIN__ for zhwikisource (duration: 00m 00s) [18:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:09] (03CR) 10jerkins-bot: [V: 04-1] wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 (owner: 10Herron) [18:56:16] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 1630a10: Set wgProofreadPagePageJoiner to __PAGEJOIN__ for zhwikisource (T205826) (duration: 00m 58s) [18:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:19] T205826: Set wgProofreadPagePageJoiner on zh.wikisource - https://phabricator.wikimedia.org/T205826 [18:56:33] !log Morning SWATconfig/backport window done [18:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:00] (03PS7) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [19:04:20] (03PS1) 10CDanis: run-puppet-agent: add new flag --unless-version SUBSTR [puppet] - 10https://gerrit.wikimedia.org/r/603577 [19:04:45] RECOVERY - PHP opcache health on mw2233 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:06:05] (03PS8) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [19:06:40] (03PS2) 10CDanis: run-puppet-agent: add new flag --unless-version SUBSTR [puppet] - 10https://gerrit.wikimedia.org/r/603577 [19:09:13] 10Operations, 10Wikidata, 10Wikidata-Query-Service: wdqs updater should be better isolated from blazegraph and common workload should be shared between servers - https://phabricator.wikimedia.org/T207837 (10Gehel) 05Open→03Declined This is being addressed as part of T244590 [19:09:17] 10Operations, 10Analytics, 10Event-Platform, 10Wikidata, and 7 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Gehel) [19:10:22] (03PS9) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [19:15:13] PROBLEM - PHP opcache health on mw2235 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:15:17] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:15:52] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/23075/" [puppet] - 10https://gerrit.wikimedia.org/r/602459 (owner: 10Herron) [19:17:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:17:45] (03PS10) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [19:23:25] (03CR) 10Mforns: [C: 03+1] "This looks great! Thanks for the tip" [puppet] - 10https://gerrit.wikimedia.org/r/602771 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [19:24:47] (03PS11) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [19:30:18] (03CR) 10Volans: [C: 03+1] "Seems reasonable, at the same time we should convert all those to Python :D" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603577 (owner: 10CDanis) [19:32:41] PROBLEM - PHP opcache health on mw2231 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:35:57] (03PS2) 10Aaron Schulz: Enable "coalesceKeys" for global keys for WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598855 [19:40:51] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:42:41] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:43:33] RECOVERY - PHP opcache health on mw2231 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:47:06] (03PS12) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 [19:51:53] PROBLEM - PHP opcache health on mw2277 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:53:11] RECOVERY - PHP opcache health on mw2235 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:56:18] (03CR) 10Herron: [C: 04-2] "getting there... PCC looks okay-ish https://puppet-compiler.wmflabs.org/compiler1001/23079/ but this is not yet safe to merge" [puppet] - 10https://gerrit.wikimedia.org/r/602459 (owner: 10Herron) [20:00:04] halfak and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T2000). [20:01:06] (03PS13) 10Herron: elasticsearch: manage java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) [20:01:52] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: manage java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [20:02:56] (03CR) 10Herron: [C: 04-2] "I'll leave this a -2 and as WIP, but requesting initial feedback as this is fairly wide reaching with risk of breakage." [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [20:04:39] (03PS14) 10Herron: elasticsearch: manage java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) [20:06:25] RECOVERY - PHP opcache health on mw2277 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:10:17] 10Operations, 10Analytics, 10Analytics-Kanban: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10Nuria) 05Open→03Resolved [20:16:15] 10Operations, 10SRE-Access-Requests: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10lmata) [20:22:07] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [20:23:03] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10wiki_willy) [20:24:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10wiki_willy) [20:24:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10wiki_willy) [20:24:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10wiki_willy) [20:24:51] PROBLEM - PHP opcache health on wtp2002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:25:51] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10wiki_willy) [20:26:20] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10wiki_willy) [20:27:03] !log Running initUserPreference.php -s growthexperiments-homepage-enable -t growthexperiments-help-panel-tog-help-panel on wikis that have GrowthExperiments installed (T240920) [20:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:08] T240920: Variant tests: turn on help panel for homepage people - https://phabricator.wikimedia.org/T240920 [20:28:09] PROBLEM - PHP opcache health on mw2274 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:32:03] (03CR) 10Mholloway: [C: 03+2] Chromium-render: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [20:32:34] (03Merged) 10jenkins-bot: Chromium-render: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [20:34:33] 10Operations, 10Wikimedia-SVG-rendering: Install (currently non-existing) Debian packages for PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181 (10Aklapper) 05Stalled→03Open Reopening per last comment. [20:37:35] RECOVERY - PHP opcache health on wtp2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:39:03] RECOVERY - PHP opcache health on mw2274 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:52:02] !log applying the sql alter table on [[gerrit:594292|ipblocks]] on labswiki (T251188) [20:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:10] T251188: ipb_address_unique has an extra column in production but not in the code - https://phabricator.wikimedia.org/T251188 [20:53:13] (03PS2) 10Cwhite: hiera: install mtail 3.0.0~rc35 from component in ulsfo and codfw [puppet] - 10https://gerrit.wikimedia.org/r/601874 (https://phabricator.wikimedia.org/T251466) [21:00:04] Reedy and sbassett: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T2100). [21:02:18] (03CR) 10Cwhite: [C: 03+2] "PCC checks out https://puppet-compiler.wmflabs.org/compiler1003/23080/" [puppet] - 10https://gerrit.wikimedia.org/r/601874 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [21:03:17] (03CR) 10Mholloway: [C: 03+2] "Hi Alex, even after waiting on the next Puppet run after this was merged, it doesn't appear that Puppet has created the .hfenv files and p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway) [21:06:55] PROBLEM - PHP opcache health on mw2242 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:17:40] (03PS1) 10Cwhite: hiera: set mtail disable_fsnotify in codfw and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/603614 (https://phabricator.wikimedia.org/T251466) [21:21:27] RECOVERY - PHP opcache health on mw2242 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:22:54] (03CR) 10Cwhite: [C: 03+2] hiera: set mtail disable_fsnotify in codfw and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/603614 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [21:26:02] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10stjn) >>! In T242767#6202740, @Ottomata wrote: > I guess I'd like to hear from the EventStreams users on... [21:40:58] 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [21:43:49] PROBLEM - PHP opcache health on mw2269 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:53:51] PROBLEM - PHP opcache health on mw2326 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:54:41] RECOVERY - PHP opcache health on mw2269 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:55:52] (03PS1) 10Cwhite: hiera: add disable_fsnotify mtail flag to ncredir [puppet] - 10https://gerrit.wikimedia.org/r/603626 (https://phabricator.wikimedia.org/T251466) [21:57:22] (03CR) 10Cwhite: [C: 03+2] hiera: add disable_fsnotify mtail flag to ncredir [puppet] - 10https://gerrit.wikimedia.org/r/603626 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [21:58:45] (03CR) 10AntiCompositeNumber: [C: 03+1] "Looks good from here as well." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603386 (https://phabricator.wikimedia.org/T254557) (owner: 10Gilles) [22:00:57] (03PS1) 10Cwhite: hiera: move ncredir config to profile [puppet] - 10https://gerrit.wikimedia.org/r/603628 (https://phabricator.wikimedia.org/T251466) [22:01:46] (03CR) 10Cwhite: [C: 03+2] hiera: move ncredir config to profile [puppet] - 10https://gerrit.wikimedia.org/r/603628 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [22:03:36] (03PS1) 10Cmjohnson: Adding relforge100[34] to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/603634 (https://phabricator.wikimedia.org/T241791) [22:04:06] (03CR) 10Cmjohnson: [C: 03+2] Adding relforge100[34] to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/603634 (https://phabricator.wikimedia.org/T241791) (owner: 10Cmjohnson) [22:12:01] RECOVERY - PHP opcache health on mw2326 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:13:33] (03PS1) 10Cmjohnson: Adding thanos-fe100[1-3] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/603637 (https://phabricator.wikimedia.org/T251620) [22:20:08] 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Papaul) switch ports removed for mw2154 through mw2186 [22:21:35] (03PS2) 10Cmjohnson: Adding thanos-fe100[1-3] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/603637 (https://phabricator.wikimedia.org/T251620) [22:22:53] (03CR) 10Cmjohnson: [C: 03+2] Adding thanos-fe100[1-3] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/603637 (https://phabricator.wikimedia.org/T251620) (owner: 10Cmjohnson) [22:23:46] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10wiki_willy) a:03Papaul [22:26:05] 10Operations, 10Core Platform Team, 10Traffic: Move wikitech purges to kafka - https://phabricator.wikimedia.org/T254828 (10Pchelolo) [22:26:50] (03Abandoned) 10Urbanecm: Remove unused logos from /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521282 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [22:27:17] (03Abandoned) 10Urbanecm: [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [22:33:55] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 64 probes of 577 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:34:55] (03PS1) 10Cmjohnson: Adding thanos-be100[1-4] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/603648 (https://phabricator.wikimedia.org/T251618) [22:36:42] (03PS2) 10Cmjohnson: Adding thanos-be100[1-4] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/603648 (https://phabricator.wikimedia.org/T251618) [22:37:37] (03PS1) 10Ppchelko: [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) [22:38:26] (03CR) 10jerkins-bot: [V: 04-1] [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [22:41:06] (03PS2) 10Ppchelko: [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) [22:45:23] (03CR) 10Cmjohnson: [C: 03+2] Adding thanos-be100[1-4] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/603648 (https://phabricator.wikimedia.org/T251618) (owner: 10Cmjohnson) [22:45:46] (03PS3) 10Ppchelko: [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) [22:45:49] PROBLEM - PHP opcache health on wtp2003 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:46:38] (03CR) 10jerkins-bot: [V: 04-1] [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [22:47:22] (03PS1) 10BryanDavis: Pywikibot container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787) [22:48:47] (03PS4) 10Ppchelko: [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) [22:49:09] !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [22:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:24] (03CR) 10BryanDavis: "I'm not sure this is best way to approach the problem, but I thought I would at least get my work out of a local directory and into gerrit" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787) (owner: 10BryanDavis) [22:52:48] (03PS1) 10Ppchelko: Enable kafka purges everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603654 (https://phabricator.wikimedia.org/T250781) [22:52:50] (03PS1) 10Ppchelko: Disbalse HTCP purges where kafka purges are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603655 (https://phabricator.wikimedia.org/T250781) [22:53:09] !log update mtail to 3.0.0~rc35 on mw and wtp hosts codfw [22:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:13] (03PS2) 10Ppchelko: Disable HTCP purges where kafka purges are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603655 (https://phabricator.wikimedia.org/T250781) [22:53:18] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [22:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:38] (03PS1) 10Cmjohnson: Add thanos-be100[1234] to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/603656 (https://phabricator.wikimedia.org/T251618) [22:54:20] (03CR) 10Cmjohnson: [C: 03+2] Add thanos-be100[1234] to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/603656 (https://phabricator.wikimedia.org/T251618) (owner: 10Cmjohnson) [22:58:16] !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [22:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T2300). [23:02:22] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [23:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:59] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 577 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:09:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-fe1003.eqiad.wmne... [23:11:39] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1001.eqiad.wmnet ` The log can be found in `/var... [23:13:54] (03PS3) 10Krinkle: logging: Combine the three custom Monolog processors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601813 [23:14:05] (03PS2) 10Krinkle: logging: Omit 'unique_id' from WebProcessor mixin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601814 (https://phabricator.wikimedia.org/T253677) [23:14:06] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1002.eqiad.wmnet ` The log can be found in `/var... [23:14:15] (03CR) 10Krinkle: [C: 03+2] logging: Combine the three custom Monolog processors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601813 (owner: 10Krinkle) [23:15:28] (03Merged) 10jenkins-bot: logging: Combine the three custom Monolog processors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601813 (owner: 10Krinkle) [23:15:29] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1003.eqiad.wmnet ` The log can be found in `/var... [23:20:15] RECOVERY - PHP opcache health on wtp2003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:21:21] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1001.eqiad.wmnet ` The log can be found in `/var... [23:21:36] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1002.eqiad.wmnet ` The log can be found in `/var... [23:21:44] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1003.eqiad.wmnet ` The log can be found in `/var... [23:22:57] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1004.eqiad.wmnet ` The log can be found in `/var... [23:23:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [23:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:33] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:01] * Krinkle testing on mwdebug1002 [23:29:49] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe1003.eqiad.wmnet'] ` and were **ALL** successful. [23:32:24] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson) [23:32:43] !log removing one file for legal compliance [23:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:14] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson) thanos-fe1003 is the only one installed at the moment. thanos-fe1001 mgmt is not working, - need to check cable thanos-fe1002 does not appea... [23:33:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:34:30] PROBLEM - PHP opcache health on mw2193 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:35:45] !log krinkle@deploy1001 Synchronized wmf-config/logging.php: I8c22a1a8fc402 (duration: 00m 58s) [23:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:36:21] (03CR) 10Legoktm: Add html web image (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601839 (https://phabricator.wikimedia.org/T241817) (owner: 10Legoktm) [23:37:33] (03PS2) 10Legoktm: Add html web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601839 (https://phabricator.wikimedia.org/T241817) [23:38:41] (03CR) 10Krinkle: [C: 03+2] logging: Omit 'unique_id' from WebProcessor mixin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601814 (https://phabricator.wikimedia.org/T253677) (owner: 10Krinkle) [23:38:45] (03PS1) 10Legoktm: Drop fam from everywhere [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603667 [23:39:31] (03Merged) 10jenkins-bot: logging: Omit 'unique_id' from WebProcessor mixin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601814 (https://phabricator.wikimedia.org/T253677) (owner: 10Krinkle) [23:39:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:40:47] (03PS1) 10BryanDavis: Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) [23:41:24] (03CR) 10Legoktm: "I didn't see this before I pushed Change-Id: Ibc99d13d63340cde3c5fdcd3c3c5a7a9255b3d76, but that already has the drop from build.py part i" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/595966 (owner: 10BryanDavis) [23:41:30] (03CR) 10jerkins-bot: [V: 04-1] Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis) [23:42:29] (03Abandoned) 10BryanDavis: Remove unused static-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/595966 (owner: 10BryanDavis) [23:42:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:43:00] (03CR) 10BryanDavis: [C: 03+1] Drop unused static-web-sssd image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601838 (owner: 10Legoktm) [23:45:08] RECOVERY - PHP opcache health on mw2193 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:46:17] (03PS1) 10Cwhite: hiera: add disable_fsnotify flag for mtail in codfw [puppet] - 10https://gerrit.wikimedia.org/r/603673 (https://phabricator.wikimedia.org/T251466) [23:47:55] (03CR) 10Krinkle: [C: 03+2] "Confirmed via mwdebug1002 that two logically identical message documents look the same before/after, except without the confusiong unique_" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601814 (https://phabricator.wikimedia.org/T253677) (owner: 10Krinkle) [23:48:17] (03CR) 10Cwhite: [C: 03+2] hiera: add disable_fsnotify flag for mtail in codfw [puppet] - 10https://gerrit.wikimedia.org/r/603673 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [23:49:12] !log krinkle@deploy1001 Synchronized wmf-config/logging.php: If991929c84ff69 (duration: 00m 57s) [23:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:41] (03PS2) 10BryanDavis: Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) [23:52:08] (03CR) 10BryanDavis: [C: 03+2] Drop unused static-web-sssd image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601838 (owner: 10Legoktm) [23:52:39] (03Merged) 10jenkins-bot: Drop unused static-web-sssd image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601838 (owner: 10Legoktm) [23:58:47] PROBLEM - PHP opcache health on mw2197 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health