[00:00:12] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:00:20] <icinga-wm>	 RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:01:26] <icinga-wm>	 RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:01:36] <icinga-wm>	 RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:01:44] <icinga-wm>	 RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:09] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Today even with {F31857474} all I got was {F31857476} Solution: Commons should simply use the same uploader as phabricator.
[00:14:33] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) https://commons.wikimedia.org/w/index.php?title=Special:Upload&uploadformstyle=basic just gets up to 68% and says    > Error > Our s...
[00:16:59] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Next attempt got to 80%, then > Request from - via cp5008.eqsin.wmnet, ATS/8.0.7 > Error: 502, Next Hop Connection Failed at 2020-06...
[00:18:49] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) OK, now got to 92%, and   > Request from - via cp5008.eqsin.wmnet, ATS/8.0.7 > Error: 502, Next Hop Connection Failed at 2020-06-08...
[00:21:24] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Nope, 92% is as far as one can get, before > Request from - via cp5008.eqsin.wmnet, ATS/8.0.7 > Error: 502, Next Hop Connection Fail...
[00:35:08] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) > Upload warning > Copy uploads are not available from this domain.  Great.
[00:41:04] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Platonides) It doesn't make any sense that you can upload to phabricator, but not to commons. I would suspect some crazy with some intermedia...
[00:41:38] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) OK finally succeeded via https://tools.wmflabs.org/url2commons/index.html copying from phabricator to https://commons.wikimedia.org/...
[00:49:02] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni)  https://commons.wikimedia.beta.wmflabs.org/ : OK, created account... Uploading .... and of course at the very end... "None of the u...
[01:02:30] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) I bet the form isn't even being sent. {F31857497} Is the biggest thing to make a curl of. And my network monitor doesn't show a lot...
[01:07:23] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) > pppd default-asyncmap defaultroute lcp-echo-failure 7 lcp-echo-interval > 50 mtu 1492 noaccomp noauth noipdefault noproxyarp persi...
[01:13:44] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Anyway, today I was given IP address 36.234.68.20, so when you check the server errors logs above, you will see me.
[01:20:57] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) OK, uploading the file to Twitter, lots of healthy yellow seen in the icewm network monitor {F31857502}  Unlike when uploading to Co...
[04:15:34] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6231 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:17:00] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 31.25 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[04:17:32] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[04:17:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:17:52] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[04:17:52] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[04:17:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_mobileapps_cluster_eqiad,swagger_check_restbase_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:17:58] <icinga-wm>	 PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:18:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:18:04] <icinga-wm>	 PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:18:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet, cp1083.eqiad.wmnet, cp1089.eqiad.wmnet, cp1075.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down b
[04:18:12] <icinga-wm>	 6_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:18:18] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:18:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:18:18] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[04:18:26] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[04:18:30] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:18:32] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[04:18:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:18:36] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[04:18:46] <icinga-wm>	 PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/
[04:18:50] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[04:19:02] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1079.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1089.eqiad.wmnet, cp1075.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: testlb6_4
[04:19:02] <icinga-wm>	 1.eqiad.wmnet, cp1079.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:19:02] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[04:19:02] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqiad.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase
[04:19:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:19:04] <icinga-wm>	 PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid
[04:19:12] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:19:16] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[04:19:18] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 170 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:19:20] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 32.29 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:19:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:19:34] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[04:19:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:19:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:19:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:19:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:19:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:19:48] <icinga-wm>	 PROBLEM - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:19:48] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:20:02] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.1 200 Ok - 32294 bytes in 0.446 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:20:06] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:20:06] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:20:26] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator
[04:20:30] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:20:30] <_joe_>	 oh fuck
[04:20:32] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:20:38] <_joe_>	 I just opened my eyes
[04:20:43] <vgutierrez>	 yeah
[04:20:58] <_joe_>	 this is a problem with memcached AFAICT
[04:21:10] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:21:10] <_joe_>	 give me 1 minute to come to my senses and sit down and I'll look
[04:21:20] <vgutierrez>	 ack
[04:21:24] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:21:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:21:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:21:30] <icinga-wm>	 RECOVERY - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 549 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:21:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:21:34] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:21:34] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:21:34] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:21:34] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:21:36] <icinga-wm>	 RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16207 bytes in 0.510 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:21:36] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:21:45] <XioNoX>	 I'm waking up to
[04:21:46] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:21:47] <marostegui>	 hey
[04:21:55] <apergos>	 same
[04:22:00] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /
[04:22:00] <icinga-wm>	 mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before
[04:22:00] <icinga-wm>	 eceived: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[04:22:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:22:08] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:22:08] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:22:12] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 37204 bytes in 3.608 second response time https://wikitech.wikimedia.org/wiki/Phabricator
[04:22:16] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:22:17] * shdubsh here
[04:22:20] <icinga-wm>	 RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 172691 bytes in 0.042 second response time https://phabricator.wikimedia.org/project/view/1118/
[04:22:21] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:22:22] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:22:26] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:22:38] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:22:38] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:22:40] <chaomodus>	 here
[04:22:43] <chaomodus>	 what's up
[04:22:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:22:50] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/71/
[04:22:52] <_joe_>	 ok, I'm looking at appservers as soon as I connect
[04:22:52] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1403 is OK: HTTP OK: HTTP/1.1 200 OK - 84213 bytes in 0.786 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:23:02] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.085e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:23:03] <_joe_>	 can someone be IC?
[04:23:08] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:08] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:18] <_joe_>	 I am looking at mcrouter logs now
[04:23:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:23:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:23:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:23:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:23:28] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:30] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:23:38] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:38] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:38] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:42] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[04:23:44] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:48] <XioNoX>	 _joe_: I can
[04:23:48] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:48] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:52] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:52] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:54] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[04:23:58] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:58] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:23:58] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:24:04] <_joe_>	 mc1029.eqiad.wmnet is the problem
[04:24:20] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:24:24] <icinga-wm>	 PROBLEM - ATS TLS has reduced HTTP availability #page on icinga1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[04:24:38] <icinga-wm>	 RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid
[04:24:40] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:24:45] <_joe_>	 enwiki:pcache:idhash:41768916-0!canonical 
[04:24:56] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:24:56] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:25:12] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:25:12] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:25:13] <_joe_>	 no idea what that is but I'm thinking of tearing down memcache on this host
[04:25:24] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:25:24] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:25:32] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:25:34] <_joe_>	 304k sized key
[04:25:40] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:25:40] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:25:42] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:25:52] <shdubsh>	 gutter pool shows a lot of activity about the time of the first page
[04:25:56] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:26:00] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:26:00] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:26:27] <_joe_>	 shdubsh: it seems we're somewhat below when things get completely moved to the gutter
[04:26:32] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:26:42] <_joe_>	 so I'm going to firewall off mc1029 now
[04:26:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:27:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:27:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:27:15] <icinga-wm>	 PROBLEM - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:27:18] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:27:20] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:27:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:27:20] <icinga-wm>	 PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:27:28] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:27:28] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:27:34] <_joe_>	 !log firewallingf off memcached on mc1029
[04:27:46] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:28:13] <DannyS712>	 Reports in #wikipedia-en of enwiki being slow - assuming its the varnish issues above?
[04:28:35] <stashbot>	 _joe_: Failed to log message to wiki. Somebody should check the error logs.
[04:29:00] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:29:00] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:29:02] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:29:02] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:29:08] <icinga-wm>	 RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16226 bytes in 5.573 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:29:10] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:29:10] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:29:18] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:29:26] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:29:28] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:29:32] <_joe_>	 let's see if this improves things
[04:29:44] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[04:29:46] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator
[04:29:48] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:29:50] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:29:54] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:29:55] <icinga-wm>	 PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/
[04:29:58] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 3.887 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:30:02] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:30:06] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:30:16] <icinga-wm>	 PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid
[04:30:20] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqiad.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase
[04:30:44] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 84129 bytes in 1.261 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:30:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:30:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:31:03] <icinga-wm>	 PROBLEM - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary inbound port utilisation over 80% #page (asw2-b-eqiad.mgmt.eqiad.wmnet) // Primary outbound port utilisation over 80% #page (cr1-eqiad.wikimedia.org) https://wikitech.wikimedia.org/wiki/Network_monitoring%23LibreNMS_alerts
[04:31:10] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:31:10] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:31:16] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[04:31:32] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:31:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:31:54] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:32:18] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 216 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:32:26] <icinga-wm>	 PROBLEM - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is CRITICAL: CRITICAL - exception while fetching the URL. 502 Server Error: Next Hop Connection Failed for url: https://en.planet.wikimedia.org/ https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org
[04:32:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:33:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:33:00] <icinga-wm>	 RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16213 bytes in 6.584 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:33:20] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:33:40] <icinga-wm>	 PROBLEM - Memcached on mc1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached
[04:34:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:34:48] <icinga-wm>	 PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:34:50] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:35:08] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:35:12] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:35:12] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:35:22] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:35:22] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:35:24] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/media-list/{title
[04:35:24] <icinga-wm>	  from storage) is CRITICAL: Test Get media-list from storage returned the unexpected status 502 (expecting: 200): /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was recei
[04:35:24] <icinga-wm>	 /page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the 
[04:35:24] <icinga-wm>	  502 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[04:35:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:36:02] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[04:36:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:36:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:36:26] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 7.028 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:28] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 7.703 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:36:34] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 541 bytes in 2.342 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:34] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:34] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:36] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:36:36] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:36] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:36:38] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:40] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:46] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:46] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:46] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:46] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:46] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:47] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:48] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:36:48] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:48] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:58] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:36:58] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:00] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:02] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[04:37:02] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:02] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:02] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[04:37:04] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:04] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 37204 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Phabricator
[04:37:06] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[04:37:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:37:12] <icinga-wm>	 RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 172692 bytes in 0.013 second response time https://phabricator.wikimedia.org/project/view/1118/
[04:37:20] <icinga-wm>	 RECOVERY - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is OK: OK - Website content is current (855 = 86400) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org
[04:37:20] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:20] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:20] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:22] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:22] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:22] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:26] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:26] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[04:37:34] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:34] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:37:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:37:38] <icinga-wm>	 RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid
[04:37:38] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1077 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:40] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:48] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[04:37:52] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:52] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:37:54] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:38:08] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[04:38:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:38:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:38:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:38:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:38:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:38:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:38:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:38:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:38:18] <icinga-wm>	 RECOVERY - LVS text eqiad port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:38:22] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[04:38:22] <icinga-wm>	 RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[04:38:22] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:38:22] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:38:23] <icinga-wm>	 RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16226 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:38:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:38:28] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[04:38:42] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:38:44] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:38:44] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:38:46] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:38:46] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[04:38:50] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:38:52] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:38:52] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:38:54] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:39:08] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:39:08] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:39:10] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:39:12] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:39:12] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 544 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:39:12] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1083 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:39:18] <icinga-wm>	 RECOVERY - ATS TLS has reduced HTTP availability #page on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[04:39:28] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[04:39:36] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:39:36] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 542 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:39:54] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[04:39:56] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[04:40:14] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1759 bytes in 0.068 second response time https://phabricator.wikimedia.org/project/view/71/
[04:41:38] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 94.07 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:43:00] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.0125 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[04:43:34] <wikibugs>	 (03PS2) 10Catrope: Enable GrowthExperiments guidance everywhere behind feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) (owner: 10Gergő Tisza)
[04:44:02] <icinga-wm>	 RECOVERY - LibreNMS has a critical alert #page on icinga1001 is OK: OK: zero critical LibreNMS alerts https://wikitech.wikimedia.org/wiki/Network_monitoring%23LibreNMS_alerts
[04:45:36] <_joe_>	 !log de-firewalling mc1029
[04:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:46:26] <icinga-wm>	 RECOVERY - Memcached on mc1029 is OK: TCP OK - 0.001 second response time on 10.64.32.209 port 11211 https://wikitech.wikimedia.org/wiki/Memcached
[05:02:06] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2020-06-08-045500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/603141 (https://phabricator.wikimedia.org/T246319)
[05:08:44] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 62 probes of 579 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:14:34] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 579 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:20:26] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Increase ats-backend max connections and max active connections [puppet] - 10https://gerrit.wikimedia.org/r/603158
[05:22:49] <marostegui>	 !log Upgrade db1077 to 10.4.13 to test events memory leak
[05:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:47] <wikibugs>	 (03PS1) 10DannyS712: Remove TranslationNotifications user settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780)
[06:00:16] <wikibugs>	 (03PS2) 10DannyS712: Remove TranslationNotifications user settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780)
[06:05:50] <wikibugs>	 (03PS1) 10Abijeet Patro: TranslationNotifications: Remove username / password for sending messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603169 (https://phabricator.wikimedia.org/T144780)
[06:06:16] <wikibugs>	 (03CR) 10DannyS712: "Dupe of https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/603167/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603169 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro)
[06:07:22] <wikibugs>	 (03Abandoned) 10Abijeet Patro: TranslationNotifications: Remove username / password for sending messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603169 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro)
[06:24:03] <wikibugs>	 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10elukey) I am all for testing new versions of memcached to get experience, so on this front you'll always have my +1 :) Upstream is also very available to help and give feedba...
[06:29:04] <icinga-wm>	 PROBLEM - Check systemd state on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:29:32] <icinga-wm>	 PROBLEM - MD RAID on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:30:18] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:30:24] <icinga-wm>	 PROBLEM - DPKG on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:30:24] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:30:26] <elukey>	 this is the lovely celery oom, running puppet --^
[06:30:52] <icinga-wm>	 RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:14] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:35:17] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10elukey) @Halfak have we reported the issue to upstream asking for an advice (https://github.com/unbit/uwsgi) ?...
[06:40:20] <icinga-wm>	 RECOVERY - MD RAID on ores1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:45:07] <wikibugs>	 10Operations, 10SRE-tools, 10Continuous-Integration-Config, 10Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10hashar) 05duplicate→03Open Reopening since this is about adding shellcheck on any repository while T254480 is specific to puppet.git and covers...
[06:52:45] <wikibugs>	 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10hashar) The CI container is build using Buster which co...
[07:01:12] <icinga-wm>	 RECOVERY - DPKG on ores1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[07:02:20] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis)
[07:05:35] <marostegui>	 !log Stop MySQL on labsdb1012 to clone labsdb1011 T249188
[07:05:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:39] <stashbot>	 T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188
[07:10:20] <wikibugs>	 (03PS2) 10Dzahn: admin: shell account for Yi-Ju Lu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/602756 (https://phabricator.wikimedia.org/T254130)
[07:10:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: shell account for Yi-Ju Lu, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/602756 (https://phabricator.wikimedia.org/T254130) (owner: 10Dzahn)
[07:20:37] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) 05Open→03Resolved Hello YiJuLu, your shell account has been created now.  Here is some more information about the SSH config you will need to...
[07:21:13] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) cc: @diego done!
[07:23:52] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Dzahn) @YiJuLu Keep in mind your shell user is "lulu".
[07:26:13] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10YiJuLu) @Dzahn  Thanks a lot :)
[07:27:37] <wikibugs>	 (03CR) 10Dzahn: admin: create shell user for Daniel Cipoletti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn)
[07:27:40] <wikibugs>	 (03PS1) 10Marostegui: install_server: Allow reimage db1141 [puppet] - 10https://gerrit.wikimedia.org/r/603338 (https://phabricator.wikimedia.org/T252512)
[07:27:53] <wikibugs>	 (03PS2) 10Dzahn: admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086)
[07:28:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db1141 [puppet] - 10https://gerrit.wikimedia.org/r/603338 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui)
[07:30:01] <wikibugs>	 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10hashar) >>! In T254491#6196244, @jbond wrote: > i have made a first pass at the [[ https://wikitech.wikimedia.org/wiki/Incident_documentatio...
[07:31:29] <wikibugs>	 (03CR) 10Ayounsi: BGP: add transit links (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/602119 (https://phabricator.wikimedia.org/T250136) (owner: 10Ayounsi)
[07:37:07] <wikibugs>	 (03CR) 10Muehlenhoff: admin: create shell user for Daniel Cipoletti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn)
[07:37:12] <moritzm>	 !log installing nodejs security updates
[07:37:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:17] <XioNoX>	 !log cr3-ulsfo protocols bgp group Transit4 family inet any -> unicast - T250136
[07:39:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:21] <stashbot>	 T250136: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136
[07:41:33] <moritzm>	 !og restarting turnilo for nodejs security update
[07:42:38] <XioNoX>	 !log cr4-ulsfo protocols bgp group Transit4 family inet any -> unicast - T250136
[07:42:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:38] <wikibugs>	 (03PS3) 10Dzahn: admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086)
[07:45:57] <wikibugs>	 (03CR) 10Dzahn: admin: create shell user for Daniel Cipoletti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn)
[07:46:43] <XioNoX>	 !log push T250136 to esams/knams - T250136
[07:46:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:47] <stashbot>	 T250136: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136
[07:46:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn)
[07:47:51] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Dzahn) Thanks @dr0ptp4kt .  Created Kerberos user [https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_principal_for_a_real_user]  @...
[07:48:15] <wikibugs>	 (03PS4) 10Dzahn: admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086)
[07:48:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: create shell user for Daniel Cipoletti [puppet] - 10https://gerrit.wikimedia.org/r/602761 (https://phabricator.wikimedia.org/T253086) (owner: 10Dzahn)
[07:50:10] <wikibugs>	 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Dzahn) a:05Ferdi2005→03None
[07:50:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris)
[07:54:30] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:56:02] <wikibugs>	 10Operations, 10Traffic: ats-backend throttles connections under heavy load - https://phabricator.wikimedia.org/T254714 (10Vgutierrez)
[07:57:08] <mutante>	 !log ran puppet on all stat* hosts for an access request (dcipoletti was added) - stat1006 systemd state broke right after, jupyter-dedcode-singleuser.service  failed
[07:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:03] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Increase ats-backend max connections and max active connections [puppet] - 10https://gerrit.wikimedia.org/r/603158 (https://phabricator.wikimedia.org/T254714)
[07:58:08] <mutante>	 ! stat1006  stat1006 bash[40607]: /bin/bash: line 0: exec: jupyterhub-singleuser: not found
[07:58:17] <mutante>	 !log stat1006 bash[40607]: /bin/bash: line 0: exec: jupyterhub-singleuser: not found
[07:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:33] <XioNoX>	 !log push T250136 to eqord/eqdfw - T250136
[07:58:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:36] <stashbot>	 T250136: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136
[07:59:07] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ATS: Increase ats-backend max connections and max active connections [puppet] - 10https://gerrit.wikimedia.org/r/603158 (https://phabricator.wikimedia.org/T254714) (owner: 10Vgutierrez)
[07:59:13] <wikibugs>	 10Operations, 10Phabricator, 10Security-Team, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10Fae) Playing with the Tor browser this morning, a work-around could be to for users to keep trying new Tor circuits until they stop getting the Error 500 message. This appe...
[08:02:15] <XioNoX>	 !log push T250136 to codfw - T250136
[08:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable Thanos upload for services [puppet] - 10https://gerrit.wikimedia.org/r/602716 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi)
[08:03:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: bail on check_command containing newlines [puppet] - 10https://gerrit.wikimedia.org/r/602669 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi)
[08:04:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: monitoring: bail on check_command containing newlines [puppet] - 10https://gerrit.wikimedia.org/r/602669 (https://phabricator.wikimedia.org/T252186)
[08:06:16] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Increase ats-backend max connections and max active connections [puppet] - 10https://gerrit.wikimedia.org/r/603158 (https://phabricator.wikimedia.org/T254714) (owner: 10Vgutierrez)
[08:06:48] <godog>	 vgutierrez: merging your change too
[08:06:49] <vgutierrez>	 godog: feel free to merge mine if you're seeing a multiple warning :)
[08:06:53] <vgutierrez>	 yeah that :D
[08:06:57] <vgutierrez>	 thanks!
[08:07:04] <godog>	 good timing :D
[08:07:06] <icinga-wm>	 RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:07:07] <mutante>	 !log stat1006 moved broken jupyter-dedcode-singleuser.service out of /run/systemd/transient.   systemctl reset-failed
[08:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:43] <moritzm>	 !log upgrading mw1349-mw1383 to PHP 7.2.31
[08:07:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:52] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Dzahn) 05Open→03Resolved @dcipoletti Your shell account has been created now.  Here is some more information about the SSH config you will need to j...
[08:09:37] <XioNoX>	 !log push T250136 to eqiad - T250136
[08:09:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:40] <stashbot>	 T250136: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136
[08:17:36] <XioNoX>	 !log push T250136 to eqsin - T250136
[08:17:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:40] <stashbot>	 T250136: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136
[08:20:09] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: add alerts for Thanos components [puppet] - 10https://gerrit.wikimedia.org/r/602633 (https://phabricator.wikimedia.org/T252186)
[08:20:11] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: enable Thanos upload for services [puppet] - 10https://gerrit.wikimedia.org/r/602716 (https://phabricator.wikimedia.org/T252186)
[08:20:13] <wikibugs>	 (03PS5) 10Filippo Giunchedi: prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186)
[08:20:15] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: enable Thanos upload for k8s [puppet] - 10https://gerrit.wikimedia.org/r/602715 (https://phabricator.wikimedia.org/T252186)
[08:20:17] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: enable Thanos upload for ops in esams [puppet] - 10https://gerrit.wikimedia.org/r/602717 (https://phabricator.wikimedia.org/T252186)
[08:20:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: thanos: add alerts for Thanos components (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602633 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi)
[08:21:22] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] BGP: add transit links [homer/public] - 10https://gerrit.wikimedia.org/r/602119 (https://phabricator.wikimedia.org/T250136) (owner: 10Ayounsi)
[08:21:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add alerts for Thanos components [puppet] - 10https://gerrit.wikimedia.org/r/602633 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi)
[08:22:52] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, and 2 others: ATS or Varnish incorrectly strips Content-Disposition header for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) a:03ema
[08:30:20] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 57 probes of 661 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:32:37] <wikibugs>	 (03PS1) 10Ayounsi: Remove unused or outdated esams AS-specific policy-statements [homer/public] - 10https://gerrit.wikimedia.org/r/603363 (https://phabricator.wikimedia.org/T250136)
[08:32:47] <wikibugs>	 (03PS1) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480)
[08:32:58] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Swift doesn't save or regenerate Content-Disposition: inline for thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) p:05Triage→03Medium a:05ema→03Gilles
[08:33:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable Thanos upload for services [puppet] - 10https://gerrit.wikimedia.org/r/602716 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi)
[08:33:23] <wikibugs>	 (03PS1) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365
[08:33:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[08:35:30] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 8 probes of 661 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:36:38] <icinga-wm>	 PROBLEM - Thanos sidecar cannot connect to Prometheus on icinga1001 is CRITICAL: cluster={bastion,prometheus} instance={bast3004:19900,bast4002:19900,bast5001:19900,prometheus1003:19900,prometheus1003:19903,prometheus1003:19905,prometheus1003:19906,prometheus1003:19907,prometheus1004:19900,prometheus1004:19903,prometheus1004:19905,prometheus1004:19906,prometheus1004:19907,prometheus2003:19900,prometheus2003:19903,prometheus2003:1
[08:36:38] <icinga-wm>	 03:19906,prometheus2004:19900,prometheus2004:19903,prometheus2004:19905,prometheus2004:19906} job=thanos-sidecar prometheus=ops site={codfw,eqiad,eqsin,esams,ulsfo} https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar
[08:37:14] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Swift doesn't save Content-Disposition: inline for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles)
[08:38:11] <wikibugs>	 (03PS2) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480)
[08:38:38] <godog>	 the thanos alert is me
[08:38:52] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Swift doesn't save Content-Disposition: inline for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) By poking at objects stored in Swift I've been able to establish that jpg thumbnails have t...
[08:40:04] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Thumbor doesn't save Content-Disposition: inline headers to Swift for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles)
[08:40:06] <wikibugs>	 (03PS2) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365
[08:40:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "The only provider we actually do something with the communities (Init7) don't use their currently valid communities https://www.as13030.ne" [homer/public] - 10https://gerrit.wikimedia.org/r/603363 (https://phabricator.wikimedia.org/T250136) (owner: 10Ayounsi)
[08:40:19] <wikibugs>	 (03PS3) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012)
[08:40:39] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused or outdated esams AS-specific policy-statements [homer/public] - 10https://gerrit.wikimedia.org/r/603363 (https://phabricator.wikimedia.org/T250136) (owner: 10Ayounsi)
[08:40:45] <wikibugs>	 (03PS1) 10Elukey: Switch backend for piwik.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/603366 (https://phabricator.wikimedia.org/T252740)
[08:42:58] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Browser-Support-Firefox: Thumbor doesn't save Content-Disposition: inline headers to Swift for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) The same might actually be true for all thumbnails, but might be masked...
[08:44:02] <icinga-wm>	 PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.421e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[08:44:26] <icinga-wm>	 PROBLEM - Prometheus prometheus1003/services restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services
[08:45:02] <godog>	 known ^
[08:47:57] <wikibugs>	 (03PS1) 10Elukey: matomo: remove unnecessary plugin [puppet] - 10https://gerrit.wikimedia.org/r/603369 (https://phabricator.wikimedia.org/T252740)
[08:48:38] <icinga-wm>	 PROBLEM - Prometheus prometheus2004/services restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services
[08:49:46] <icinga-wm>	 PROBLEM - Prometheus prometheus2003/services restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services
[08:52:30] <wikibugs>	 (03CR) 10RhinosF1: [C: 04-1] "hold for on-task discussion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012) (owner: 10RhinosF1)
[08:52:36] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] matomo: remove unnecessary plugin [puppet] - 10https://gerrit.wikimedia.org/r/603369 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey)
[08:52:59] <wikibugs>	 (03CR) 10Dzahn: "currently I don't see why the script content seems to be empty in compiler: https://puppet-compiler.wmflabs.org/compiler1001/23051/phab100" [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[08:53:54] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Aklapper) @Jidanni: Could you please not create 12 comments in 70 minutes, but instead first run tests and then at the end properly summarize...
[08:57:10] <icinga-wm>	 RECOVERY - Thanos compact has not run on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:00:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:04:56] <icinga-wm>	 PROBLEM - Thanos compact has disappeared from Prometheus discovery on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview
[09:06:42] <icinga-wm>	 RECOVERY - Prometheus prometheus1003/services restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services
[09:06:46] <icinga-wm>	 RECOVERY - Thanos compact has disappeared from Prometheus discovery on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview
[09:07:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:08:20] <wikibugs>	 10Operations, 10netops: Telia eqiad<->codfw (IC-307235) outage ref: 01171084 - https://phabricator.wikimedia.org/T254674 (10Dzahn) "we experienced a brief service disruption in a card in Selma, AL, impacting our transmission stretch between Atlanta and Houston.  A cold reboot of the card restored service.  We...
[09:11:38] <icinga-wm>	 RECOVERY - Prometheus prometheus2004/services restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services
[09:12:53] <icinga-wm>	 RECOVERY - Prometheus prometheus2003/services restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services
[09:13:40] <icinga-wm>	 PROBLEM - More than one Thanos compact running on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:15:18] <icinga-wm>	 PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.421e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:24:24] <wikibugs>	 (03CR) 10Volans: "Nice! Couple of minor things inline, all the rest are optional nits." (0311 comments) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 (owner: 10Elukey)
[09:26:12] <icinga-wm>	 RECOVERY - Thanos compact has not run on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:26:22] <icinga-wm>	 RECOVERY - More than one Thanos compact running on icinga1001 is OK: (C)1 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:29:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:30:18] <godog>	 ooof sorry for the alert spam
[09:30:44] <wikibugs>	 (03PS4) 10Muehlenhoff: Extend Cassandra cookbook to also cover maps [cookbooks] - 10https://gerrit.wikimedia.org/r/602318
[09:31:15] <wikibugs>	 (03CR) 10Muehlenhoff: Extend Cassandra cookbook to also cover maps (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff)
[09:32:28] <qchris>	 !log Turning on puppet on gerrit1002 again to avoid starting to lag too far behind
[09:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Extend Cassandra cookbook to also cover maps [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff)
[09:33:18] <mutante>	 qchris: 👍
[09:33:43] <wikibugs>	 10Operations, 10netops: Telia eqiad<->codfw (IC-307235) outage ref: 01171084 - https://phabricator.wikimedia.org/T254674 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks, all back to normal.
[09:34:02] <icinga-wm>	 PROBLEM - Thanos compact has disappeared from Prometheus discovery on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview
[09:34:24] <qchris>	 :-)
[09:34:41] <wikibugs>	 (03PS5) 10Muehlenhoff: Extend Cassandra cookbook to also cover maps [cookbooks] - 10https://gerrit.wikimedia.org/r/602318
[09:38:53] <wikibugs>	 (03PS1) 10Gilles: Store Content-Disposition header in Swift [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603386 (https://phabricator.wikimedia.org/T254557)
[09:39:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Store Content-Disposition header in Swift [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603386 (https://phabricator.wikimedia.org/T254557) (owner: 10Gilles)
[09:40:39] <wikibugs>	 (03CR) 10Volans: "Minor nits inline and a question/suggestion." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (https://phabricator.wikimedia.org/T252807) (owner: 10Muehlenhoff)
[09:41:18] <icinga-wm>	 RECOVERY - Thanos compact has disappeared from Prometheus discovery on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview
[09:41:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:44:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "> Patch Set 2:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[09:45:41] <wikibugs>	 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10jbond) >>! In T254480#6198075, @ArielGlenn wrote: > @jb...
[09:46:53] <moritzm>	 !log installing gnutls28 security updates on buster (older releases not affected)
[09:46:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:54] <icinga-wm>	 PROBLEM - Thanos compact has not run on icinga1001 is CRITICAL: 4.421e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:48:06] <icinga-wm>	 PROBLEM - More than one Thanos compact running on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:48:19] <wikibugs>	 (03PS1) 10Elukey: matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740)
[09:49:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey)
[09:49:42] <icinga-wm>	 RECOVERY - Thanos compact has not run on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:49:52] <icinga-wm>	 RECOVERY - More than one Thanos compact running on icinga1001 is OK: (C)1 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:51:57] <wikibugs>	 (03PS2) 10Elukey: matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740)
[09:53:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey)
[10:00:42] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: enable bulk and slo middlewares for s3api compat [puppet] - 10https://gerrit.wikimedia.org/r/603394 (https://phabricator.wikimedia.org/T252186)
[10:00:42] <icinga-wm>	 PROBLEM - More than one Thanos compact running on icinga1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[10:08:15] <wikibugs>	 (03CR) 10Jbond: puppet-merge: fix shellcheck issues (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond)
[10:08:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/23053/" [puppet] - 10https://gerrit.wikimedia.org/r/603394 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi)
[10:08:43] <wikibugs>	 (03PS2) 10Filippo Giunchedi: swift: enable bulk and slo middlewares for s3api compat [puppet] - 10https://gerrit.wikimedia.org/r/603394 (https://phabricator.wikimedia.org/T252186)
[10:08:52] <wikibugs>	 (03PS3) 10Jbond: puppet-merge: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480)
[10:10:40] <wikibugs>	 (03CR) 10Jbond: puppet-merge: split dynamic values out of puppet-merge script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond)
[10:10:46] <wikibugs>	 (03PS3) 10Jbond: puppet-merge: split dynamic values out of puppet-merge script [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480)
[10:12:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/602649 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond)
[10:13:57] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, and 2 others: Thumbor doesn't save Content-Disposition: inline headers to Swift for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10TheDJ) Good detective work @gilles !
[10:14:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] wmflib: add systemd.timer OnCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 (https://phabricator.wikimedia.org/T210818) (owner: 10Cwhite)
[10:15:04] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I see one issue with the way the current "cluster" is defined, see inline for the details." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff)
[10:23:54] <wikibugs>	 (03PS3) 10Elukey: matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740)
[10:25:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey)
[10:27:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, couple of optional nits inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 (owner: 10Jbond)
[10:27:12] <wikibugs>	 (03PS4) 10Elukey: matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740)
[10:28:04] <wikibugs>	 10Operations, 10Commons: Incorrect information in category table for commonswiki - https://phabricator.wikimedia.org/T254734 (10Base)
[10:30:05] <jouncebot>	 jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T1030).
[10:30:35] <wikibugs>	 (03PS1) 10Dzahn: wikistats (cloud): update query for XML dump of Wikipedia table [puppet] - 10https://gerrit.wikimedia.org/r/603407 (https://phabricator.wikimedia.org/T254214)
[10:31:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] wikistats (cloud): update query for XML dump of Wikipedia table [puppet] - 10https://gerrit.wikimedia.org/r/603407 (https://phabricator.wikimedia.org/T254214) (owner: 10Dzahn)
[10:31:41] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603408 (https://phabricator.wikimedia.org/T128546)
[10:32:57] <wikibugs>	 (03PS1) 10Ayounsi: Depool codfw for routre upgrade [dns] - 10https://gerrit.wikimedia.org/r/603409 (https://phabricator.wikimedia.org/T243080)
[10:33:13] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603408 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:34:05] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603408 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:37:46] <wikibugs>	 (03PS2) 10Ayounsi: Depool codfw for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/603409 (https://phabricator.wikimedia.org/T243080)
[10:38:39] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Depool codfw for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/603409 (https://phabricator.wikimedia.org/T243080) (owner: 10Ayounsi)
[10:38:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Depool codfw for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/603409 (https://phabricator.wikimedia.org/T243080) (owner: 10Ayounsi)
[10:39:03] <wikibugs>	 (03PS5) 10Elukey: matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740)
[10:39:17] <XioNoX>	 !log depool codfw - T243080
[10:39:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:35] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:603408| Bumping portals to master (603408)]] (duration: 01m 09s)
[10:40:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:33] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:603408| Bumping portals to master (603408)]] (duration: 00m 57s)
[10:41:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:53] <XioNoX>	 !log bump all cr1-codfw OSPF metrics - T243080
[10:41:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:24] <XioNoX>	 !log deactivate cr1-codfw transit/peering - T243080
[10:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on RT [puppet] - 10https://gerrit.wikimedia.org/r/603412 (https://phabricator.wikimedia.org/T135991)
[10:46:14] <XioNoX>	 !log install Junos on cr1-codfw:re1 (backup) - T243080
[10:46:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/603414 (https://phabricator.wikimedia.org/T135991)
[10:50:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Enable base::service_auto_restart for Apache on RT [puppet] - 10https://gerrit.wikimedia.org/r/603412 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:51:36] <wikibugs>	 (03CR) 10Volans: "Looks sane, although I'm not familiar with the PDU interface to say if that part is correct or not. Minor things inline." (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond)
[10:51:41] <XioNoX>	 !log reboot cr1-codfw:re1 (backup) - T243080
[10:53:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Enable base::service_auto_restart for Apache on peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/603414 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:56:57] <XioNoX>	 !log do cr1-codfw RE mastership switch - T243080
[10:56:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:26] <wikibugs>	 (03CR) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[10:59:03] <XioNoX>	 waiting for linecards to reboot
[11:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T1100).
[11:00:05] <jouncebot>	 tgr and lucaswerkmeister: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:19] <tgr>	 o/
[11:00:26] <Lucas_WMDE>	 o/
[11:00:47] <Lucas_WMDE>	 lucaswerkmeister is about to join us, his IRC client is being slow for some reason
[11:00:52] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:00:56] <Lucas_WMDE>	 tgr: do you want to deploy yourself or should I do it?
[11:01:03] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Looks mostly ok to me. I can't assure all the junos commands are correct though. One error, see inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 (owner: 10Ayounsi)
[11:01:16] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:02:43] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:02:55] <tgr>	 Lucas_WMDE: I'll leave it to you, thanks!
[11:03:04] <Lucas_WMDE>	 ok!
[11:03:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:03:23] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Enable GrowthExperiments guidance everywhere behind feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) (owner: 10Gergő Tisza)
[11:03:30] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) (owner: 10Gergő Tisza)
[11:04:24] <wikibugs>	 (03Merged) 10jenkins-bot: Enable GrowthExperiments guidance everywhere behind feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) (owner: 10Gergő Tisza)
[11:04:25] <Lucas_WMDE>	 apparently lucaswerkmeister isn’t allowed to join this channel :S
[11:04:41] <Lucas_WMDE>	 so I guess I’ll do both halves of the deployment from this nickname, meh
[11:04:53] <Lucas_WMDE>	 (it’s a volunteer change that I’m SWAT deploying as WMDE staff)
[11:04:59] <Lucas_WMDE>	 anyways, tgr first
[11:05:00] <XioNoX>	 !log install Junos on cr1-codfw:re0 (backup) - T243080
[11:05:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:29] <Lucas_WMDE>	 tgr: change is on mwdebug1001, can you test it?
[11:05:34] <marostegui>	 !log Install events on es1 T254689
[11:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:38] <stashbot>	 T254689: Check that all core hosts have events installed and enabled - https://phabricator.wikimedia.org/T254689
[11:05:45] <tgr>	 the channel should be open to anyone as long as they are registered to services
[11:05:55] <tgr>	 testing
[11:06:10] <Lucas_WMDE>	 I thought that nick was registered but I guess I’m wrong
[11:07:12] <RhinosF1>	 Lucas_WMDE: it is, you just haven't identified to it
[11:07:39] <RhinosF1>	 From its client do "/msg NickServ identity yourpassword"
[11:08:06] <tgr>	 Lucas_WMDE: it works, thanks!
[11:08:36] <Lucas_WMDE>	 ok, syncing
[11:09:41] <lucaswerkmeister>	 o/
[11:09:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:601409|Enable GrowthExperiments guidance everywhere behind feature flag (T253794)]] (duration: 00m 57s)
[11:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:00] <stashbot>	 T253794: Newcomer tasks: hidden preference for guidance - https://phabricator.wikimedia.org/T253794
[11:10:00] <lucaswerkmeister>	 thanks RhinosF1, that fixed it
[11:10:13] <RhinosF1>	 :)
[11:10:28] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove Wikibase idBlacklist setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602981 (https://phabricator.wikimedia.org/T254686) (owner: 10Lucas Werkmeister)
[11:10:46] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602981 (https://phabricator.wikimedia.org/T254686) (owner: 10Lucas Werkmeister)
[11:11:01] <XioNoX>	 !log reboot cr1-codfw:re0 (backup) - T243080
[11:11:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:38] <wikibugs>	 (03Merged) 10jenkins-bot: Remove Wikibase idBlacklist setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602981 (https://phabricator.wikimedia.org/T254686) (owner: 10Lucas Werkmeister)
[11:12:09] <Lucas_WMDE>	 lucaswerkmeister: change is on mwdebug1001, please test
[11:12:12] <lucaswerkmeister>	 ok
[11:12:18] <lucaswerkmeister>	 let me find an example lexeme to create
[11:12:52] <lucaswerkmeister>	 created https://www.wikidata.org/wiki/Lexeme:L301969, ID looks normal
[11:13:01] <lucaswerkmeister>	 looks like everything’s still working
[11:13:06] <Lucas_WMDE>	 ok, syncing
[11:13:13] <lucaswerkmeister>	 (I’ll fill in the rest of the lexeme later)
[11:13:34] <Lucas_WMDE>	 I guess I’ll first sync Wikibase.php, so that wmgWikibaseIdBlacklist is no longer read
[11:13:39] <Lucas_WMDE>	 and then IS.php, so it’s no longer set either
[11:13:41] <Lucas_WMDE>	 I think that’s the safe order
[11:13:46] <Lucas_WMDE>	 and better than syncing all of wmf-config/ at once
[11:15:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:602981|Remove Wikibase idBlacklist setting (T254686)]], part 1 (duration: 00m 56s)
[11:15:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:20] <stashbot>	 T254686: Rename WikibaseRepo’s idBlacklist setting - https://phabricator.wikimedia.org/T254686
[11:15:26] <XioNoX>	 !log cr1-codfw> request chassis routing-engine master switch - T243080
[11:15:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:23] <icinga-wm>	 PROBLEM - Host pfw3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[11:16:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:602981|Remove Wikibase idBlacklist setting (T254686)]], part 2 (duration: 00m 56s)
[11:16:35] <volans>	 XioNoX: related? ^^^
[11:16:40] <volans>	 to the ongoing work
[11:16:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:51] <XioNoX>	 volans: shouldn't but probably
[11:16:59] <icinga-wm>	 RECOVERY - Host pfw3-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
[11:17:07] <jynus>	 was there user impact?
[11:17:11] <XioNoX>	 nop
[11:17:26] <apergos>	 paged 
[11:17:31] <marostegui>	 I have acked it
[11:18:00] <XioNoX>	 I think monitoring lost a few pings
[11:18:00] <volans>	 me too
[11:18:08] <Lucas_WMDE>	 I think that’s it for the EU SWAT
[11:18:22] <volans>	 marostegui: something's weird, you ack'ed the recovery
[11:18:30] <volans>	 shouldn't that recover? why need to be acked?
[11:18:33] <marostegui>	 oh
[11:18:33] <_joe_>	 so victorops didn't see the recovery as the recovery
[11:18:34] <Lucas_WMDE>	 I hope those codfw messages aren’t SWAT-related?
[11:18:38] <volans>	 yeah
[11:18:39] <volans>	 weird
[11:18:39] <_joe_>	 Lucas_WMDE: no
[11:18:41] <marostegui>	 weird
[11:18:42] <jynus>	 there is this: https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&from=1591604312681&to=1591615112681
[11:18:55] <Lucas_WMDE>	 !log EU SWAT done
[11:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:58] <_joe_>	 please someone report this to the observability team
[11:19:07] <volans>	 I'll take crae of it
[11:19:20] <icinga-wm>	 PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 64, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:19:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:19:36] <wikibugs>	 (03PS1) 10Marostegui: db1141: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603419 (https://phabricator.wikimedia.org/T252512)
[11:19:48] <XioNoX>	 I'm staying focused on my upgrade
[11:19:59] <XioNoX>	 but will look at the pfw3-codfw afterwards
[11:20:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1141: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/603419 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui)
[11:21:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:22:58] <icinga-wm>	 RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 66, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:23:26] <wikibugs>	 10Operations, 10Traffic: ats-backend throttles connections under heavy load - https://phabricator.wikimedia.org/T254714 (10jbond) p:05Triage→03Medium
[11:23:50] <volans>	 XioNoX: ack, it seems that the cr2/pfw3 connection was lost and I bet the icinga check was going over that one.
[11:24:41] <XioNoX>	 yeah it shouldn't though as I was only working on cr1
[11:25:17] <volans>	 I've sent an email to observability to investigate the VO side
[11:25:29] <wikibugs>	 10Operations, 10Phabricator, 10Security-Team, 10Traffic: Accessing Phabricator from Tor - https://phabricator.wikimedia.org/T254568 (10jbond) p:05Triage→03Medium
[11:26:16] <wikibugs>	 10Operations, 10Security, 10User-MoritzMuehlenhoff, 10User-jbond: Ferm sometimes (rarely) fails to reload - https://phabricator.wikimedia.org/T254477 (10jbond) p:05Triage→03Medium
[11:27:27] <wikibugs>	 (03CR) 10Muehlenhoff: Extend Cassandra cookbook to also cover maps (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff)
[11:28:08] <wikibugs>	 (03PS3) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480)
[11:28:21] <XioNoX>	 !log cr1-codfw add graceful-switchover - T243080
[11:28:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:39] <wikibugs>	 (03CR) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[11:29:23] <XioNoX>	 !log cr1-codfw add graceful-restart - T243080
[11:29:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:43] <wikibugs>	 10Operations, 10Security, 10User-MoritzMuehlenhoff, 10User-jbond: Ferm sometimes (rarely) fails to reload - https://phabricator.wikimedia.org/T254477 (10jbond) I think the problem here is that when `/usr/sbin/ferm -nl --domain ip /etc/ferm/ferm.conf` is run it sometimes fails to resolve DNS hosts.  We coul...
[11:29:45] <wikibugs>	 (03PS6) 10Muehlenhoff: Drop maps from supported clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/602318
[11:30:26] <XioNoX>	 !log cr1-codfw re-enable transit/peering - T243080
[11:30:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:13] <wikibugs>	 (03PS4) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480)
[11:32:04] <XioNoX>	 !log cr1-codfw set OSPF metrics back to normal - T243080
[11:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:35] <wikibugs>	 10Operations, 10Security, 10User-MoritzMuehlenhoff, 10User-jbond: Ferm sometimes (rarely) fails to reload - https://phabricator.wikimedia.org/T254477 (10MoritzMuehlenhoff) Agreed, it's definitely related to failing DNS lookups, this happened more often until https://github.com/wikimedia/puppet/commit/5e8e6...
[11:33:37] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] beta: Allow using docker volumes [puppet] - 10https://gerrit.wikimedia.org/r/601717 (https://phabricator.wikimedia.org/T251176) (owner: 10Alexandros Kosiaris)
[11:33:49] <wikibugs>	 10Operations, 10Security, 10User-MoritzMuehlenhoff, 10User-jbond: Ferm sometimes (rarely) fails to reload - https://phabricator.wikimedia.org/T254477 (10jbond) might be better to resolve the DNS on the puppet master and only have IP's in the ferm config (no idea how much effort that would be though)
[11:35:52] <wikibugs>	 (03PS5) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480)
[11:36:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[11:38:01] <XioNoX>	 !log fail vrrp master from cr2 to cr1 - T243080
[11:38:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:05] <XioNoX>	 !log deactivate cr2-codfw transit/peering - T243080
[11:39:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:34] <wikibugs>	 (03PS6) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480)
[11:41:41] <XioNoX>	 !log de-pref cr2-codfw OSPF - T243080
[11:41:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:16] <moritzm>	 !log rolling restart of Apache on Kibana/7 host to pick up Gnu TLS security update
[11:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:30] <moritzm>	 !log restarting slapd on ldap-corp* for Gnu TLS security update
[11:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:32] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[11:47:28] <wikibugs>	 (03PS7) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480)
[11:49:32] <XioNoX>	 !log reboot cr2-codfw:re1 (backup) - T243080
[11:49:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:07] <marostegui>	 !log Deploy schema change on s3 - T251188
[11:53:23] <moritzm>	 !log restarting dnsdist on malmok
[11:53:44] <XioNoX>	 !log cr2-codfw> request chassis routing-engine master switch - T243080
[11:53:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:55] <stashbot>	 T251188: ipb_address_unique has an extra column in production but not in the code (WAS: ipb_address_unique has an extra column in the code but not in production) - https://phabricator.wikimedia.org/T251188
[11:53:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:47] <XioNoX>	 waiting for the linecards to boot up
[11:56:03] <wikibugs>	 (03PS8) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480)
[11:56:23] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:56:33] <wikibugs>	 (03PS9) 10Dzahn: phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480)
[11:57:17] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:57:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:57:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "works now: https://puppet-compiler.wmflabs.org/compiler1002/23059/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[11:58:33] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:58:47] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:58:49] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:01:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: convert 2 scripts created from erb to files with config files [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[12:04:40] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 88 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:05:49] <wikibugs>	 (03PS7) 10Jbond: sre.pdus.rotate-password: split generic functions out to __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/598021
[12:05:50] <XioNoX>	 !log reboot cr2-codfw:re0 (backup) - T243080
[12:05:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:58] <XioNoX>	 !log cr2-codfw> request chassis routing-engine master switch - T243080
[12:09:59] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 58 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:10:00] <XioNoX>	 last one!
[12:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:15] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:13:17] * kart_ is updating cxserver. 
[12:14:19] <kart_>	 XioNoX: Is it OK to deploy?
[12:14:20] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:14:48] <XioNoX>	 kart_: I'm done in 30s
[12:14:55] <kart_>	 OK!
[12:15:04] <XioNoX>	 checking that router came back as expected
[12:15:41] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:16:05] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:16:55] <XioNoX>	 kart_: yep all good!
[12:17:03] <XioNoX>	 thx for asking!
[12:17:11] <kart_>	 :)
[12:17:14] <wikibugs>	 (03PS1) 10Kormat: Add native mysql spicerack moodule. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434
[12:17:23] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-06-08-045500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/603141 (https://phabricator.wikimedia.org/T246319) (owner: 10KartikMistry)
[12:17:55] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2020-06-08-045500-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/603141 (https://phabricator.wikimedia.org/T246319) (owner: 10KartikMistry)
[12:18:01] <marostegui>	 !log Compress InnoDB on db2094:3311 T254462
[12:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:05] <stashbot>	 T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462
[12:18:22] <XioNoX>	 !log rollback cr2-codfw vrrp/ospf/bgp changes - T243080
[12:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:29] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 117 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:20:50] <kart_>	 akosiaris: b3b13a6ae5ac4da5cfcafb65e28ca7b03b2a3069 seems there in deployment-chart, seems undeployed?
[12:21:14] <kart_>	 My mistake. Sorry. but, something for sure.
[12:22:34] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Depool codfw for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/603436
[12:22:38] <kart_>	 akosiaris: cxserver-0.0.19 - is it OK to deploy?
[12:22:58] <moritzm>	 kart_: it's a bank holiday in Greece today
[12:23:11] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Depool codfw for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/603436 (owner: 10Ayounsi)
[12:23:20] <kart_>	 Ouch. OK. I need to revert my merge then.
[12:23:24] <wikibugs>	 (03PS1) 10JMeybohm: lvs::configuration: add termbox-https [puppet] - 10https://gerrit.wikimedia.org/r/603437 (https://phabricator.wikimedia.org/T254581)
[12:23:27] <XioNoX>	 !log repool codfw - T243080
[12:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:43] <wikibugs>	 (03PS1) 10KartikMistry: Revert "Update cxserver to 2020-06-08-045500-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/603438
[12:24:18] <wikibugs>	 (03PS17) 10Kormat: install_server: Allow reuse of partitions during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027)
[12:25:11] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] "Need to check unmerged changes in deployment-charts repository before deploy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/603438 (owner: 10KartikMistry)
[12:25:21] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:25:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Update cxserver to 2020-06-08-045500-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/603438 (owner: 10KartikMistry)
[12:27:58] <wikibugs>	 (03CR) 10Kormat: "Dear reviewers," [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat)
[12:29:24] <wikibugs>	 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Dzahn) >>! In T252932#6174068, @JGulingan wrote: > Just to clarify, IT does not manage donate@. Can you clarify if this points to Fundraising's zendesk email?  Yea, so far it's managed by SRE.  That's why...
[12:30:01] <wikibugs>	 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Dzahn) a:03JGulingan
[12:33:07] <wikibugs>	 (03PS1) 10Cmjohnson: Add relforge100[34] to netboot cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/603440 (https://phabricator.wikimedia.org/T241791)
[12:33:56] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson)
[12:34:23] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Add relforge100[34] to netboot cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/603440 (https://phabricator.wikimedia.org/T241791) (owner: 10Cmjohnson)
[12:34:26] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Fix Thanos compact alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/603441 (https://phabricator.wikimedia.org/T252186)
[12:35:51] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson)
[12:36:25] <wikibugs>	 (03CR) 10Dzahn: "tested both scripts.. noticed one does not work due to sender address being gone.. fixing but unrelated to this change." [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[12:39:32] <wikibugs>	 (03PS1) 10Dzahn: phabricator: change sender address of community_metrics mail [puppet] - 10https://gerrit.wikimedia.org/r/603445
[12:39:34] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Fix Thanos compact alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/603441 (https://phabricator.wikimedia.org/T252186)
[12:39:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Fix Thanos compact alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/603441 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi)
[12:41:49] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Add support on tls.lua for http requests [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235)
[12:44:52] <wikibugs>	 (03PS35) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890)
[12:45:44] <wikibugs>	 (03PS4) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012)
[12:47:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond)
[12:47:41] <wikibugs>	 (03PS5) 10RhinosF1: Add namespace alias and project namespace for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603365 (https://phabricator.wikimedia.org/T254012)
[12:52:01] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM if we want to go towards this approach. I'll to the people more involved in the related services to decide the direction to go." [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff)
[12:52:51] <wikibugs>	 (03PS1) 10RhinosF1: Enable WikiLove on slwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603453
[12:53:46] <XioNoX>	 volans: I think the pfw3 issue was a race condition
[12:54:16] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Small nit on the help message, looks good otherwise." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 (owner: 10Jbond)
[12:54:30] <volans>	 XioNoX: on the check side?
[12:54:32] <wikibugs>	 (03PS2) 10RhinosF1: Enable WikiLove on slwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603453 (https://phabricator.wikimedia.org/T254706)
[12:55:24] <XioNoX>	 the pfw facing port on cr1 probably came up a tad before the other ports, causing the pfw to try to route through it (because of MEDs)
[12:55:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] matomo: move archive cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey)
[12:55:52] <XioNoX>	 and because no other port up, traffic got blackholed
[12:56:01] <jbond42>	 godog: volans: 
[12:56:10] <jbond42>	 sorry been so long since i did that :)
[12:56:26] <volans>	 jbond42: 2 birds with one stone! :D
[12:56:28] <volans>	 XioNoX: ack
[12:56:52] <wikibugs>	 (03PS36) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890)
[12:58:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond)
[13:01:48] <wikibugs>	 (03CR) 10Muehlenhoff: matomo: move archive cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey)
[13:03:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: allow NaN for Thanos compact/query errors [puppet] - 10https://gerrit.wikimedia.org/r/603457 (https://phabricator.wikimedia.org/T252186)
[13:03:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile: allow NaN for Thanos compact/query errors [puppet] - 10https://gerrit.wikimedia.org/r/603457 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi)
[13:04:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] matomo: move archive cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603391 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey)
[13:05:14] <wikibugs>	 (03PS2) 10Filippo Giunchedi: profile: allow NaN for Thanos compact/query errors [puppet] - 10https://gerrit.wikimedia.org/r/603457 (https://phabricator.wikimedia.org/T252186)
[13:05:16] <wikibugs>	 (03PS6) 10Filippo Giunchedi: prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186)
[13:05:18] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: enable Thanos upload for k8s [puppet] - 10https://gerrit.wikimedia.org/r/602715 (https://phabricator.wikimedia.org/T252186)
[13:05:21] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: enable Thanos upload for ops in esams [puppet] - 10https://gerrit.wikimedia.org/r/602717 (https://phabricator.wikimedia.org/T252186)
[13:07:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] profile: allow NaN for Thanos compact/query errors [puppet] - 10https://gerrit.wikimedia.org/r/603457 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi)
[13:09:32] <wikibugs>	 (03PS2) 10Kormat: Add native mysql spicerack moodule. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434
[13:10:46] <wikibugs>	 (03PS1) 10Elukey: profile::piwik::instance: fix archiver's settings [puppet] - 10https://gerrit.wikimedia.org/r/603460 (https://phabricator.wikimedia.org/T252740)
[13:13:38] <icinga-wm>	 RECOVERY - More than one Thanos compact running on icinga1001 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[13:13:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::piwik::instance: fix archiver's settings [puppet] - 10https://gerrit.wikimedia.org/r/603460 (https://phabricator.wikimedia.org/T252740) (owner: 10Elukey)
[13:15:10] <godog>	 I'll be roll-restarting prometheus 'ops' instance, no impact expected
[13:18:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: merge ops instance role into profile [puppet] - 10https://gerrit.wikimedia.org/r/602409 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi)
[13:21:36] <godog>	 actually no that was a lie, no restart needed
[13:30:26] <wikibugs>	 (03CR) 10Volans: "Thanks for the patch! It's nice to see progress on this! I did a first pass, let's chat offline about the details and potential future exp" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat)
[13:32:50] <wikibugs>	 (03PS37) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890)
[13:36:21] <wikibugs>	 (03PS8) 10Jbond: sre.pdus.rotate-password: split generic functions out to __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/598021
[13:36:29] <wikibugs>	 (03PS38) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890)
[13:36:40] <icinga-wm>	 PROBLEM - PHP opcache health on mw2241 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[13:36:54] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10ema) 05Open→03Declined >>! In T242767#6199410, @MrJaroslavik wrote: > Hey, can be fixed this problem?...
[13:41:35] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers
[13:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:16] <elukey>	 jumbo cluster --^
[13:45:00] <wikibugs>	 (03CR) 10Jcrespo: "Please don't discuss fully offline, there are things that Riccardo won't know about our MySQL setup that I could help with." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat)
[13:45:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2241 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[13:46:57] <wikibugs>	 (03PS1) 10Elukey: sre.kafka.roll-restart-brokers: improve documentation readability [cookbooks] - 10https://gerrit.wikimedia.org/r/603473
[13:47:43] <wikibugs>	 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10CDanis) 05Open→03Resolved
[13:49:00] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[13:49:52] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[13:50:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/603364 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[13:50:57] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update mitigations for T250887 (duration: 00m 57s)
[13:50:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:02] <wikibugs>	 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10Aklapper) @CDanis: Please feel free to {nav icon=anchor,name=Edit Related Tasks... > Close As Duplicate} in the upper right corner. Thanks!
[13:53:11] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Add support on tls.lua for http requests [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235)
[13:53:12] <wikibugs>	 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10Aklapper)
[13:54:49] <wikibugs>	 (03PS4) 10Jbond: puppet-merge: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480)
[13:54:53] <wikibugs>	 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10CDanis) I don't think they're strictly speaking duplicates; this task was for tracking the incident itself and writing the document in the first place; the ot...
[13:56:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet-merge: split dynamic values out of puppet-merge script [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond)
[13:57:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet-merge: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602738 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond)
[13:58:34] <wikibugs>	 (03PS1) 10Vgutierrez: mtail: Adjust ATS TTFB buckets [puppet] - 10https://gerrit.wikimedia.org/r/603475
[13:58:46] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) Hm, EventStreams uses the Server Sent Events for this very reason.  I don't think anyone is exp...
[13:58:49] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[13:58:50] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:58:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:57] <wikibugs>	 (03PS1) 10Jbond: whiespace CR to check puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/603476
[14:00:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] whiespace CR to check puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/603476 (owner: 10Jbond)
[14:00:56] <jbond42>	 !log updating puppet-merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/602738/4
[14:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:00] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "documentation only" [cookbooks] - 10https://gerrit.wikimedia.org/r/603473 (owner: 10Elukey)
[14:03:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.kafka.roll-restart-brokers: improve documentation readability [cookbooks] - 10https://gerrit.wikimedia.org/r/603473 (owner: 10Elukey)
[14:05:48] <wikibugs>	 10Operations, 10Patch-For-Review, 10SRE-OnFire-Incident-Docs: Document 2020-02-04 kartotherian incident - https://phabricator.wikimedia.org/T244278 (10Aklapper) 05duplicate→03Resolved Ah. Sorry!
[14:09:27] <wikibugs>	 (03PS1) 10Jbond: puppet-merge: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/603479
[14:10:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet-merge: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/603479 (owner: 10Jbond)
[14:10:40] <wikibugs>	 (03PS1) 10Jbond: Revert "whiespace CR to check puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/603480
[14:11:18] <wikibugs>	 (03CR) 10Ema: [C: 03+1] mtail: Adjust ATS TTFB buckets [puppet] - 10https://gerrit.wikimedia.org/r/603475 (owner: 10Vgutierrez)
[14:11:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "whiespace CR to check puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/603480 (owner: 10Jbond)
[14:13:16] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/598021 (owner: 10Jbond)
[14:14:06] <wikibugs>	 (03PS1) 10Ladsgroup: Wrap WAN-cached PropertyInfoLookup with an APCu cache [extensions/Wikibase] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603482 (https://phabricator.wikimedia.org/T254536)
[14:14:15] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop: simplify config writing. make beta config write puppet-friendly YAML. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598026 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan)
[14:14:57] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: simplify config writing. make beta config write puppet-friendly YAML. [deployment-charts] - 10https://gerrit.wikimedia.org/r/598026 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan)
[14:15:17] <wikibugs>	 (03PS1) 10Jbond: puppetmasters::scritps: join arrays [puppet] - 10https://gerrit.wikimedia.org/r/603483
[14:18:38] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] puppetmasters::scritps: join arrays [puppet] - 10https://gerrit.wikimedia.org/r/603483 (owner: 10Jbond)
[14:18:49] <cdanis>	 jbond42: lmk if you need a hand
[14:19:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetmasters::scritps: join arrays [puppet] - 10https://gerrit.wikimedia.org/r/603483 (owner: 10Jbond)
[14:19:46] <jbond42>	 cdanis: thanks i think i got it now, just a couple of silly issues that sliped through
[14:20:41] <cdanis>	 that's the usual with that script, it'd be interesting to have a 'proper' testing environment for it, but that also sounds like a lot of work
[14:22:14] <jbond42>	 yes and yes :)
[14:22:17] <wikibugs>	 (03PS2) 10Vgutierrez: mtail: Adjust ATS TTFB buckets [puppet] - 10https://gerrit.wikimedia.org/r/603475 (https://phabricator.wikimedia.org/T254714)
[14:22:18] <jbond42>	 godog: volans: 
[14:22:28] <jbond42>	 bad day for it iapparently :(
[14:23:40] <jbond42>	 puppet-merge all working again
[14:23:58] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] mtail: Adjust ATS TTFB buckets [puppet] - 10https://gerrit.wikimedia.org/r/603475 (https://phabricator.wikimedia.org/T254714) (owner: 10Vgutierrez)
[14:25:15] <wikibugs>	 (03PS1) 10Jbond: Revert "Revert "whiespace CR to check puppet-merge"" [puppet] - 10https://gerrit.wikimedia.org/r/603485
[14:26:06] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "one typo and one suggestion inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond)
[14:26:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "It looks like role::osm::common isn't used in production for katotherian etc, but rather role::maps (either ::master or ::slave) via profi" [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway)
[14:26:43] <godog>	 jbond42: haha!
[14:27:37] <wikibugs>	 (03PS1) 10CDanis: fix puppet merge typos [puppet] - 10https://gerrit.wikimedia.org/r/603487
[14:28:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/603487 (owner: 10CDanis)
[14:28:20] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] fix puppet merge typos [puppet] - 10https://gerrit.wikimedia.org/r/603487 (owner: 10CDanis)
[14:28:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] fix puppet merge typos [puppet] - 10https://gerrit.wikimedia.org/r/603487 (owner: 10CDanis)
[14:28:56] <Amir1>	 Deploying this now: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/603482
[14:29:07] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Wrap WAN-cached PropertyInfoLookup with an APCu cache [extensions/Wikibase] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603482 (https://phabricator.wikimedia.org/T254536) (owner: 10Ladsgroup)
[14:30:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "Revert "whiespace CR to check puppet-merge"" [puppet] - 10https://gerrit.wikimedia.org/r/603485 (owner: 10Jbond)
[14:33:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Minor comments inline, LGTM overall" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[14:33:58] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Systemd::Servicename: make it reflect reality e.g. php7.2-fpm [puppet] - 10https://gerrit.wikimedia.org/r/601460 (owner: 10CDanis)
[14:34:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add loki_event filter script [puppet] - 10https://gerrit.wikimedia.org/r/602729 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[14:35:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "see comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/603437 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm)
[14:37:14] <wikibugs>	 (03PS1) 10Dzahn: httpbb: convert an .erb.sh script to inline content [puppet] - 10https://gerrit.wikimedia.org/r/603490 (https://phabricator.wikimedia.org/T254480)
[14:37:51] <wikibugs>	 (03CR) 10Jbond: "Thanks, updated ill also preform some more testing tomorrow before this gets merged" (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond)
[14:38:45] <wikibugs>	 (03PS39) 10Jbond: cookbooks sre.hosts.rotate-pdu-password: reset SNMP [cookbooks] - 10https://gerrit.wikimedia.org/r/594445 (https://phabricator.wikimedia.org/T246890)
[14:41:52] <moritzm>	 !log upgrading mw API servers in codfw to PHP 7.2.31
[14:41:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:02] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: enable partitioned jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602430 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan)
[14:45:20] <wikibugs>	 (03PS3) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434
[14:46:30] <wikibugs>	 (03PS1) 10Dzahn: icinga: convert sync_icinga_state.sh.erb to file with config [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480)
[14:47:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: convert sync_icinga_state.sh.erb to file with config [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[14:47:46] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕚☕ sudo cumin A:mw-canary 'disable-puppet "cdanis deploying I25ab44c1 T252605"'
[14:47:48] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis)
[14:47:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:50] <stashbot>	 T252605: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605
[14:48:02] <wikibugs>	 (03PS2) 10Dzahn: icinga: convert sync_icinga_state.sh.erb to file with config [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480)
[14:48:09] <wikibugs>	 (03CR) 10Kormat: Add native mysql spicerack module. (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (owner: 10Kormat)
[14:48:54] <papaul>	 !log powering down ms-be2016 for BBU replacement 
[14:48:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:43] <wikibugs>	 (03PS2) 10JMeybohm: lvs::configuration: add termbox-https [puppet] - 10https://gerrit.wikimedia.org/r/603437 (https://phabricator.wikimedia.org/T254581)
[14:51:20] <icinga-wm>	 PROBLEM - Host ms-be2016 is DOWN: PING CRITICAL - Packet loss = 100%
[14:51:31] <wikibugs>	 (03PS5) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171
[14:52:15] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
[14:52:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:08] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕚☕ sudo cumin A:mw-canary 'enable-puppet "cdanis deploying I25ab44c1 T252605"' 
[14:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:12] <stashbot>	 T252605: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605
[14:54:05] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10jbond) @Jclark-ctr ping: Are you able to respond to the comments and questions from Daniel above, thanks
[14:54:38] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "Could not find any files from role/icinga/sync_icinga_state.sh" [puppet] - 10https://gerrit.wikimedia.org/r/603492 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[14:55:17] <wikibugs>	 (03PS2) 10Hnowlan: changeprop-jobqueue: enable partitioned jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602430 (https://phabricator.wikimedia.org/T220399)
[14:57:55] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop-jobqueue: enable partitioned jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602430 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan)
[14:58:13] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Jclark-ctr) @jbond   have recently had issues with computer have reached out to IT will be reimaged
[14:58:23] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: enable partitioned jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/602430 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan)
[15:01:42] <wikibugs>	 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10herron) 05Open→03Resolved a:03herron
[15:02:54] <wikibugs>	 (03Merged) 10jenkins-bot: Wrap WAN-cached PropertyInfoLookup with an APCu cache [extensions/Wikibase] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603482 (https://phabricator.wikimedia.org/T254536) (owner: 10Ladsgroup)
[15:05:52] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[15:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:57] <Amir1>	 Deployed on mwdebug1001 and works fine
[15:09:31] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.35/extensions/Wikibase/lib/includes/Store/CachingPropertyInfoLookup.php: Wrap WAN-cached PropertyInfoLookup with an APCu cache, Part I out of III (T254536) (duration: 00m 59s)
[15:09:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:35] <stashbot>	 T254536: CacheAwarePropertyInfoStore performs 4000 Memc ops/s (APC not working?) - https://phabricator.wikimedia.org/T254536
[15:10:14] <icinga-wm>	 RECOVERY - Host ms-be2016 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms
[15:10:54] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.35/extensions/Wikibase/repo/includes/Store/Sql/SqlStore.php: Wrap WAN-cached PropertyInfoLookup with an APCu cache, Part II out of III (T254536) (duration: 00m 57s)
[15:10:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:02] <wikibugs>	 10Operations, 10ops-codfw: BBU faulty on ms-be2016 - https://phabricator.wikimedia.org/T252851 (10Papaul) 05Open→03Resolved BBU replacement complete
[15:11:15] <wikibugs>	 (03PS6) 10EBernhardson: Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171
[15:11:50] <icinga-wm>	 PROBLEM - PHP opcache health on mw2199 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:12:35] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.35/extensions/Wikibase/client/includes/Store/Sql/DirectSqlStore.php: Wrap WAN-cached PropertyInfoLookup with an APCu cache, Part III out of III (T254536) (duration: 00m 57s)
[15:12:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:29] <Amir1>	 _joe_: It's deployed now
[15:13:43] <wikibugs>	 (03PS2) 10MSantos: maps: profile::rsyslog::udp_localhost_compat [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway)
[15:14:00] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: StatsD Exporter drops relayed metrics - https://phabricator.wikimedia.org/T239833 (10colewhite) 05Open→03Declined Still a problem, but probably not big enough to warrant the effort.
[15:15:18] <wikibugs>	 10Operations, 10observability: mtail rc35 stops incrementing atsmtail counters - https://phabricator.wikimedia.org/T254192 (10colewhite) 05Open→03Resolved This issue hasn't resurfaced since disabling fsnotify.  Moving forward with the upgrade.
[15:16:12] <icinga-wm>	 RECOVERY - HP RAID on ms-be2016 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[15:19:10] <icinga-wm>	 PROBLEM - PHP opcache health on mw2139 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:19:26] <icinga-wm>	 PROBLEM - PHP opcache health on mw2138 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:20:50] <icinga-wm>	 PROBLEM - PHP opcache health on mw2136 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:20:53] <_joe_>	 Amir1: we had a drop in memcached requests around the time of the deploy :))
[15:20:57] <Amir1>	 https://grafana.wikimedia.org/d/000000316/memcache?panelId=21&fullscreen&orgId=1&from=1591627189995&to=1591629557781
[15:20:58] <icinga-wm>	 PROBLEM - PHP opcache health on mw2135 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:21:08] <wikibugs>	 (03CR) 10Mholloway: maps: profile::rsyslog::udp_localhost_compat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway)
[15:21:20] <icinga-wm>	 PROBLEM - PHP opcache health on mw2137 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:21:23] <Amir1>	 is this caused by my patch ^
[15:21:50] <Amir1>	 this is codfw
[15:22:20] <Amir1>	 _joe_: \o/ my estimation is reduction of 25K reqs/s = 5% total requests 
[15:22:26] <icinga-wm>	 PROBLEM - PHP opcache health on mw2144 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:22:38] <icinga-wm>	 RECOVERY - PHP opcache health on mw2199 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:22:43] <wikibugs>	 (03PS1) 10CDanis: expand phpfpm status text exporter to all appservers [puppet] - 10https://gerrit.wikimedia.org/r/603511 (https://phabricator.wikimedia.org/T252605)
[15:22:46] <icinga-wm>	 PROBLEM - PHP opcache health on mw2140 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:22:47] <_joe_>	 don't worry about codfw, sigh
[15:22:48] <icinga-wm>	 PROBLEM - PHP opcache health on mw2147 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:23:02] <icinga-wm>	 PROBLEM - PHP opcache health on mw2146 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:23:06] <icinga-wm>	 PROBLEM - PHP opcache health on mw2142 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:23:10] <icinga-wm>	 PROBLEM - PHP opcache health on mw2145 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:23:22] <icinga-wm>	 PROBLEM - PHP opcache health on mw2143 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:23:49] <wikibugs>	 (03PS1) 10Ppchelko: Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781)
[15:24:41] <logmsgbot>	 !log hnowlan@deploy1001 Started deploy [cpjobqueue/deploy@07d8c32]: Disabling jobs migrated to k8s
[15:24:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko)
[15:25:25] <Amir1>	 afk for lunch
[15:26:35] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10Mholloway) This happened again over the weekend.  I've restarted it.
[15:27:00] <wikibugs>	 (03PS2) 10Ppchelko: Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781)
[15:27:03] <_joe_>	 Amir1: didn't knew you moved to the US :D
[15:27:09] <_joe_>	 or well south america
[15:28:12] <logmsgbot>	 !log jynus@cumin2001 dbctl commit (dc=all): 'depool db2075 for mw maintenance T254139', diff saved to https://phabricator.wikimedia.org/P11411 and previous config saved to /var/cache/conftool/dbconfig/20200608-152811-jynus.json
[15:28:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:16] <stashbot>	 T254139: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139
[15:29:16] <logmsgbot>	 !log hnowlan@deploy1001 Finished deploy [cpjobqueue/deploy@07d8c32]: Disabling jobs migrated to k8s (duration: 04m 34s)
[15:29:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:40] <wikibugs>	 (03CR) 10CDanis: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/23066/" [puppet] - 10https://gerrit.wikimedia.org/r/603511 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis)
[15:29:41] <hnowlan>	 !log Migrated all cpjobqueue jobs from scb to Kubernetes 
[15:29:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:02] <icinga-wm>	 RECOVERY - PHP opcache health on mw2140 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:30:25] <Amir1>	 _joe_: lol, Canada :P
[15:34:22] <icinga-wm>	 PROBLEM - PHP opcache health on wtp2008 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:37:35] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10Papaul) @hnowlan Hello any reason why this is still open? Thanks
[15:40:01] <wikibugs>	 (03PS1) 10Elukey: role::swap: remove access to analytics users [puppet] - 10https://gerrit.wikimedia.org/r/603522 (https://phabricator.wikimedia.org/T249752)
[15:40:43] <icinga-wm>	 PROBLEM - PHP opcache health on mw2203 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:41:00] <icinga-wm>	 PROBLEM - PHP opcache health on mw2209 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:41:02] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::swap: remove access to analytics users [puppet] - 10https://gerrit.wikimedia.org/r/603522 (https://phabricator.wikimedia.org/T249752) (owner: 10Elukey)
[15:41:02] <icinga-wm>	 PROBLEM - PHP opcache health on mw2204 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:41:04] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10hnowlan) 05Open→03Resolved
[15:41:06] <icinga-wm>	 RECOVERY - PHP opcache health on mw2138 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:41:12] <icinga-wm>	 RECOVERY - PHP opcache health on mw2145 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:41:13] <icinga-wm>	 RECOVERY - PHP opcache health on mw2137 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:41:18] <icinga-wm>	 PROBLEM - PHP opcache health on mw2206 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:41:58] <icinga-wm>	 PROBLEM - PHP opcache health on mw2208 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:42:12] <icinga-wm>	 PROBLEM - PHP opcache health on mw2201 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:42:22] <icinga-wm>	 PROBLEM - PHP opcache health on mw2202 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:42:26] <icinga-wm>	 PROBLEM - PHP opcache health on mw2200 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:42:40] <icinga-wm>	 RECOVERY - PHP opcache health on mw2135 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:06] <icinga-wm>	 RECOVERY - PHP opcache health on mw2144 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:12] <icinga-wm>	 PROBLEM - PHP opcache health on mw2211 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:26] <icinga-wm>	 PROBLEM - PHP opcache health on mw2207 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2147 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:28] <icinga-wm>	 PROBLEM - PHP opcache health on mw2218 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:40] <icinga-wm>	 PROBLEM - PHP opcache health on mw2216 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:42] <icinga-wm>	 RECOVERY - PHP opcache health on mw2146 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:44] <icinga-wm>	 PROBLEM - PHP opcache health on mw2210 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:44] <icinga-wm>	 PROBLEM - PHP opcache health on mw2214 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:52] <icinga-wm>	 PROBLEM - PHP opcache health on mw2215 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:44:54] <icinga-wm>	 PROBLEM - PHP opcache health on mw2212 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:45:03] <icinga-wm>	 PROBLEM - PHP opcache health on mw2217 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:45:10] <icinga-wm>	 PROBLEM - PHP opcache health on mw2305 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:45:32] <icinga-wm>	 PROBLEM - PHP opcache health on mw2219 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:45:52] <icinga-wm>	 PROBLEM - PHP opcache health on mw2307 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:46:06] <icinga-wm>	 PROBLEM - PHP opcache health on mw2304 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:46:08] <icinga-wm>	 PROBLEM - PHP opcache health on mw2309 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:46:11] <wikibugs>	 (03PS1) 10Ammarpad: Remove Mobile mainpage special casing from it and vec wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603524 (https://phabricator.wikimedia.org/T254731)
[15:46:34] <icinga-wm>	 PROBLEM - PHP opcache health on mw2301 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:46:34] <icinga-wm>	 RECOVERY - PHP opcache health on mw2142 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:46:45] <wikibugs>	 (03PS1) 10Elukey: profile::swap: skip deployment of mysql-credentials [puppet] - 10https://gerrit.wikimedia.org/r/603525 (https://phabricator.wikimedia.org/T249752)
[15:46:58] <icinga-wm>	 RECOVERY - PHP opcache health on wtp2008 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:47:00] <icinga-wm>	 PROBLEM - PHP opcache health on mw2306 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:47:42] <icinga-wm>	 RECOVERY - PHP opcache health on mw2307 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:47:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::swap: skip deployment of mysql-credentials [puppet] - 10https://gerrit.wikimedia.org/r/603525 (https://phabricator.wikimedia.org/T249752) (owner: 10Elukey)
[15:48:26] <icinga-wm>	 PROBLEM - PHP opcache health on mw2303 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:48:33] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603524 (https://phabricator.wikimedia.org/T254731) (owner: 10Ammarpad)
[15:49:50] <icinga-wm>	 RECOVERY - PHP opcache health on mw2207 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:49:52] <icinga-wm>	 RECOVERY - PHP opcache health on mw2139 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:50:02] <icinga-wm>	 RECOVERY - PHP opcache health on mw2209 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:52:02] <icinga-wm>	 RECOVERY - PHP opcache health on mw2303 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:53:40] <icinga-wm>	 PROBLEM - PHP opcache health on mw2283 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:53:54] <icinga-wm>	 PROBLEM - PHP opcache health on mw2288 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:53:56] <icinga-wm>	 PROBLEM - PHP opcache health on mw2289 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:54:06] <icinga-wm>	 PROBLEM - PHP opcache health on mw2285 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:54:12] <icinga-wm>	 PROBLEM - PHP opcache health on mw2286 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:54:18] <icinga-wm>	 PROBLEM - PHP opcache health on mw2284 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:54:56] <icinga-wm>	 PROBLEM - PHP opcache health on mw2287 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:57:32] <icinga-wm>	 PROBLEM - PHP opcache health on mw2332 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:57:56] <icinga-wm>	 RECOVERY - PHP opcache health on mw2284 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:57:56] <icinga-wm>	 PROBLEM - PHP opcache health on mw2333 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:58:02] <cdanis>	 👀
[15:58:14] <icinga-wm>	 PROBLEM - PHP opcache health on mw2298 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:58:14] <icinga-wm>	 PROBLEM - PHP opcache health on mw2331 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:58:32] <icinga-wm>	 PROBLEM - PHP opcache health on mw2299 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:58:34] <icinga-wm>	 PROBLEM - PHP opcache health on mw2257 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:58:48] <icinga-wm>	 PROBLEM - PHP opcache health on mw2254 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:59:08] <icinga-wm>	 PROBLEM - PHP opcache health on mw2290 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:59:08] <icinga-wm>	 PROBLEM - PHP opcache health on mw2256 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:59:20] <icinga-wm>	 PROBLEM - PHP opcache health on mw2293 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:59:20] <icinga-wm>	 PROBLEM - PHP opcache health on mw2291 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:59:22] <icinga-wm>	 PROBLEM - PHP opcache health on mw2255 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:00:04] <icinga-wm>	 PROBLEM - Check systemd state on mw2244 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:00:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2201 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:00:18] <icinga-wm>	 PROBLEM - PHP opcache health on mw2244 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:00:22] <icinga-wm>	 PROBLEM - PHP opcache health on mw2227 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:00:56] <icinga-wm>	 RECOVERY - PHP opcache health on mw2216 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:00:56] <icinga-wm>	 PROBLEM - PHP opcache health on mw2228 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:01:04] <icinga-wm>	 PROBLEM - PHP opcache health on mw2225 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:01:34] <icinga-wm>	 RECOVERY - PHP opcache health on mw2333 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:01:48] <icinga-wm>	 RECOVERY - PHP opcache health on mw2208 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:01:54] <icinga-wm>	 PROBLEM - PHP opcache health on mw2229 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:01:58] <icinga-wm>	 PROBLEM - PHP opcache health on mw2297 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:01:58] <icinga-wm>	 PROBLEM - PHP opcache health on mw2226 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:02:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Great work. If you want to test this on more baremetal servers, feel free to use the sretest* systems for this as well." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601761 (https://phabricator.wikimedia.org/T252027) (owner: 10Kormat)
[16:02:36] <wikibugs>	 (03PS1) 10Ppchelko: Beta: Switch from HTCP purging to kafka purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603530 (https://phabricator.wikimedia.org/T250781)
[16:02:44] <icinga-wm>	 RECOVERY - PHP opcache health on mw2204 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:02:58] <icinga-wm>	 RECOVERY - PHP opcache health on mw2206 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:03:27] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "No strong feelings, LGTM if you prefer it this way. Thanks for reminding me that I need to get around to debianizing this." [puppet] - 10https://gerrit.wikimedia.org/r/603490 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[16:03:38] <icinga-wm>	 RECOVERY - PHP opcache health on mw2219 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:04:06] <icinga-wm>	 RECOVERY - PHP opcache health on mw2200 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:04:13] <icinga-wm>	 RECOVERY - PHP opcache health on mw2309 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:04:13] <icinga-wm>	 RECOVERY - PHP opcache health on mw2203 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:04:38] <icinga-wm>	 RECOVERY - PHP opcache health on mw2301 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:04:38] <icinga-wm>	 RECOVERY - PHP opcache health on mw2214 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:04:46] <icinga-wm>	 RECOVERY - PHP opcache health on mw2332 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:04:58] <icinga-wm>	 RECOVERY - PHP opcache health on mw2217 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:05:01] <wikibugs>	 (03PS5) 10Urbanecm: Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21)
[16:05:04] <icinga-wm>	 RECOVERY - PHP opcache health on mw2305 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:05:05] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/603490 (https://phabricator.wikimedia.org/T254480) (owner: 10Dzahn)
[16:05:29] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Sustainability (Incident Prevention): Test gutter pool failover in production  and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) (//Moving to team inbox for next meeting.//)
[16:05:33] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21)
[16:05:40] <icinga-wm>	 PROBLEM - PHP opcache health on mw2367 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:05:54] <icinga-wm>	 RECOVERY - PHP opcache health on mw2211 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:05:56] <icinga-wm>	 PROBLEM - PHP opcache health on mw2361 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:06:10] <icinga-wm>	 PROBLEM - PHP opcache health on mw2364 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:06:22] <icinga-wm>	 PROBLEM - PHP opcache health on mw2369 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:06:34] <icinga-wm>	 RECOVERY - PHP opcache health on mw2215 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:06:36] <icinga-wm>	 RECOVERY - PHP opcache health on mw2255 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:07:06] <icinga-wm>	 PROBLEM - PHP opcache health on mw2368 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:07:16] <icinga-wm>	 PROBLEM - PHP opcache health on mw2365 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:07:20] <icinga-wm>	 PROBLEM - PHP opcache health on mw2363 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:07:20] <icinga-wm>	 RECOVERY - PHP opcache health on mw2229 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:07:24] <icinga-wm>	 PROBLEM - PHP opcache health on mw2366 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:07:36] <icinga-wm>	 RECOVERY - PHP opcache health on mw2299 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:07:40] <icinga-wm>	 PROBLEM - PHP opcache health on mw2198 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:08:02] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Drop maps from supported clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/602318 (owner: 10Muehlenhoff)
[16:08:26] <icinga-wm>	 RECOVERY - PHP opcache health on mw2212 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:08:48] <icinga-wm>	 PROBLEM - PHP opcache health on mw2253 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:09:04] <icinga-wm>	 PROBLEM - PHP opcache health on mw2252 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:09:32] <icinga-wm>	 PROBLEM - PHP opcache health on mw2251 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:10:06] <icinga-wm>	 RECOVERY - PHP opcache health on mw2225 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:10:20] <wikibugs>	 10Operations, 10Analytics, 10Traffic: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10Milimetric) p:05Medium→03High
[16:10:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2286 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:10:54] <icinga-wm>	 RECOVERY - PHP opcache health on mw2331 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:11:12] <icinga-wm>	 RECOVERY - PHP opcache health on mw2227 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:11:12] <icinga-wm>	 RECOVERY - PHP opcache health on mw2257 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:11:42] <icinga-wm>	 PROBLEM - PHP opcache health on mw2221 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:11:46] <icinga-wm>	 RECOVERY - PHP opcache health on mw2283 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:11:48] <icinga-wm>	 PROBLEM - PHP opcache health on mw2220 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:11:56] <icinga-wm>	 PROBLEM - PHP opcache health on mw2359 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:12:24] <icinga-wm>	 PROBLEM - PHP opcache health on mw2355 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:12:32] <icinga-wm>	 PROBLEM - PHP opcache health on mw2350 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:12:40] <icinga-wm>	 PROBLEM - PHP opcache health on mw2358 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:12:40] <icinga-wm>	 RECOVERY - PHP opcache health on mw2252 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:13:02] <icinga-wm>	 PROBLEM - PHP opcache health on mw2222 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:13:02] <icinga-wm>	 PROBLEM - PHP opcache health on mw2357 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:13:06] <icinga-wm>	 RECOVERY - PHP opcache health on mw2202 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:13:16] <icinga-wm>	 PROBLEM - PHP opcache health on mw2351 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:13:16] <icinga-wm>	 PROBLEM - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:13:24] <icinga-wm>	 PROBLEM - PHP opcache health on mw2223 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:13:36] <icinga-wm>	 RECOVERY - PHP opcache health on mw2228 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:13:48] <icinga-wm>	 RECOVERY - PHP opcache health on mw2288 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:14:00] <icinga-wm>	 RECOVERY - PHP opcache health on mw2285 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:14:14] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10Dzahn) What's up with the certificate renewal issue ? (15 days left). Does it need a separate task?
[16:14:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, ran tests with https://wikitech.wikimedia.org/wiki/Thumbor#Local_development" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603386 (https://phabricator.wikimedia.org/T254557) (owner: 10Gilles)
[16:15:06] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] "should be deployable in any SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595019 (owner: 10Mstyles)
[16:15:20] <icinga-wm>	 RECOVERY - PHP opcache health on mw2221 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:15:30] <icinga-wm>	 RECOVERY - PHP opcache health on mw2210 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:15:32] <icinga-wm>	 PROBLEM - PHP opcache health on mw2371 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:15:52] <icinga-wm>	 RECOVERY - PHP opcache health on mw2306 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:16:00] <icinga-wm>	 RECOVERY - PHP opcache health on mw2355 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:16:26] <icinga-wm>	 PROBLEM - PHP opcache health on mw2376 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:16:26] <icinga-wm>	 PROBLEM - PHP opcache health on mw2374 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:16:26] <icinga-wm>	 RECOVERY - PHP opcache health on mw2226 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:16:36] <icinga-wm>	 PROBLEM - PHP opcache health on mw2372 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:17:20] <icinga-wm>	 PROBLEM - PHP opcache health on mw2375 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:17:26] <icinga-wm>	 RECOVERY - PHP opcache health on mw2293 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:18:08] <icinga-wm>	 RECOVERY - PHP opcache health on mw2298 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:18:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2297 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:19:03] <icinga-wm>	 RECOVERY - PHP opcache health on mw2290 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:19:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2291 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:20:38] <icinga-wm>	 RECOVERY - PHP opcache health on mw2218 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:21:04] <icinga-wm>	 RECOVERY - PHP opcache health on mw2289 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:22:24] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[16:23:38] <icinga-wm>	 RECOVERY - PHP opcache health on mw2366 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:23:48] <icinga-wm>	 RECOVERY - PHP opcache health on mw2244 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:24:06] <icinga-wm>	 RECOVERY - PHP opcache health on mw2304 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:24:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2364 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:25:10] <icinga-wm>	 RECOVERY - PHP opcache health on mw2368 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:25:18] <icinga-wm>	 RECOVERY - PHP opcache health on mw2365 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:25:23] <icinga-wm>	 RECOVERY - PHP opcache health on mw2363 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:25:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw2376 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:25:56] <icinga-wm>	 RECOVERY - PHP opcache health on mw2254 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:26:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2369 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:27:00] <wikibugs>	 (03PS1) 10Hnowlan: changeprop: remove changeprop from puppet [puppet] - 10https://gerrit.wikimedia.org/r/603534 (https://phabricator.wikimedia.org/T220399)
[16:27:20] <icinga-wm>	 RECOVERY - PHP opcache health on mw2367 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:27:38] <icinga-wm>	 RECOVERY - PHP opcache health on mw2361 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:29:24] <icinga-wm>	 RECOVERY - PHP opcache health on mw2251 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:29:40] <icinga-wm>	 RECOVERY - PHP opcache health on mw2223 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:29:53] <icinga-wm>	 RECOVERY - PHP opcache health on mw2220 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:31:06] <icinga-wm>	 RECOVERY - PHP opcache health on mw2357 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:31:18] <icinga-wm>	 RECOVERY - PHP opcache health on mw2351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:31:20] <icinga-wm>	 RECOVERY - PHP opcache health on mw2136 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:31:42] <icinga-wm>	 RECOVERY - PHP opcache health on mw2256 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:31:50] <icinga-wm>	 RECOVERY - PHP opcache health on mw2359 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:32:26] <icinga-wm>	 RECOVERY - PHP opcache health on mw2350 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:32:52] <icinga-wm>	 RECOVERY - PHP opcache health on mw2222 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:33:08] <icinga-wm>	 RECOVERY - PHP opcache health on mw2353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:33:36] <icinga-wm>	 RECOVERY - PHP opcache health on mw2371 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:33:50] <icinga-wm>	 RECOVERY - PHP opcache health on mw2143 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:34:04] <icinga-wm>	 RECOVERY - PHP opcache health on mw2253 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:34:40] <icinga-wm>	 RECOVERY - PHP opcache health on mw2372 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:38:17] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10hnowlan) Yeah I think so - there are multiple hosts affected by this issue. Tracking in T254784
[16:39:32] <icinga-wm>	 PROBLEM - PHP opcache health on mw2319 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:39:47] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) I ran the mtr tests earlier in this bug report.  Running a curl gives  ` {"errors":[{"code":"empty-file","html":"The file you submit...
[16:40:46] <icinga-wm>	 RECOVERY - PHP opcache health on mw2375 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:42:08] <wikibugs>	 (03CR) 10MSantos: maps: profile::rsyslog::udp_localhost_compat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway)
[16:43:00] <wikibugs>	 (03PS3) 10MSantos: maps: profile::rsyslog::udp_localhost_compat [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway)
[16:43:22] <icinga-wm>	 RECOVERY - PHP opcache health on mw2358 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:44:15] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10stjn) This is a very strange conclusion to this task. There was never an assumption that you do not need...
[16:44:56] <liw>	 !log testing upcoming Scap release on beta
[16:44:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:08] <wikibugs>	 (03CR) 10Mholloway: [C: 03+1] maps: profile::rsyslog::udp_localhost_compat [puppet] - 10https://gerrit.wikimedia.org/r/602704 (https://phabricator.wikimedia.org/T222377) (owner: 10Mholloway)
[16:45:34] <icinga-wm>	 PROBLEM - PHP opcache health on wtp2018 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:47:06] <icinga-wm>	 RECOVERY - PHP opcache health on mw2374 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:47:18] <icinga-wm>	 RECOVERY - PHP opcache health on mw2287 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:47:22] <icinga-wm>	 RECOVERY - PHP opcache health on mw2198 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:54:42] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0)
[16:54:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:14] <elukey>	 \o/
[16:55:50] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker
[16:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:56] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:57:50] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10Papaul) Before   `  BIOS Version   2.4.3 Firmware Version   2.40.40.40 IP Address(es)   10.193.1.55 iDRAC MAC Address   84:7B:EB:F6:97:56 DNS Domai...
[16:58:18] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:58:43] <wikibugs>	 (03CR) 10Ladsgroup: "I will deploy this tomorrow if there's no objection." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/602675 (https://phabricator.wikimedia.org/T111853) (owner: 10Ladsgroup)
[16:58:44] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10Papaul) 05Open→03Resolved @jcrespo firmware upgrade complete
[16:58:47] <wikibugs>	 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Papaul)
[16:59:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2319 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:00:04] <jouncebot>	 gehel and onimisionipe: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T1700)
[17:00:06] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10BBlack) >>! In T242767#6201754, @Ottomata wrote: [reordering a little] > What happens right now if someon...
[17:01:09] <Amir1>	 Krinkle: let me know when you want to backport the change
[17:01:50] <Krinkle>	 Amir1: link?
[17:02:02] <Amir1>	 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/603536
[17:02:29] <wikibugs>	 (03PS1) 10Krinkle: mediawiki.misc-authed-curate: Check for 'showrollbackconfirmation' preference [core] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603544 (https://phabricator.wikimedia.org/T254538)
[17:02:32] <Krinkle>	 Amir1: sure :)
[17:02:50] <Krinkle>	 oh neat, the wikibugs changed was rolled out 
[17:03:07] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "UBN" [core] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603544 (https://phabricator.wikimedia.org/T254538) (owner: 10Krinkle)
[17:04:10] <icinga-wm>	 PROBLEM - PHP opcache health on mw2270 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:04:40] <Majavah>	 what wikibugs change?
[17:06:15] <wikibugs>	 (03PS1) 10Dave Pifke: Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761)
[17:06:32] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37
[17:06:44] <icinga-wm>	 RECOVERY - Check systemd state on mw2244 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:07:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke)
[17:11:45] <wikibugs>	 (03PS1) 10Catrope: GrowthExperiments: End A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603548 (https://phabricator.wikimedia.org/T254413)
[17:13:10] <icinga-wm>	 RECOVERY - PHP opcache health on mw2270 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:13:22] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: add loki_event filter script [puppet] - 10https://gerrit.wikimedia.org/r/602729 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[17:14:03] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0)
[17:14:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:10] <wikibugs>	 (03PS1) 10Dave Pifke: [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761)
[17:18:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke)
[17:19:17] <wikibugs>	 (03PS10) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826)
[17:19:54] <wikibugs>	 (03CR) 10Cwhite: profile: add loki output support to the logstash pipeline (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[17:21:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[17:21:12] <icinga-wm>	 PROBLEM - PHP opcache health on mw2275 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:21:32] <icinga-wm>	 RECOVERY - PHP opcache health on wtp2018 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:23:29] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) (Thanks for the response bblack!)  > 2. Does the typical client handle the disconnect gracefull...
[17:24:34] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] expand phpfpm status text exporter to all appservers [puppet] - 10https://gerrit.wikimedia.org/r/603511 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis)
[17:25:08] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] expand phpfpm status text exporter to all appservers [puppet] - 10https://gerrit.wikimedia.org/r/603511 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis)
[17:28:17] <wikibugs>	 (03PS1) 10Urbanecm: Do not require opt-in for guidance in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603552 (https://phabricator.wikimedia.org/T254789)
[17:30:19] <wikibugs>	 (03PS2) 10Dave Pifke: [WIP] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761)
[17:31:19] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki.misc-authed-curate: Check for 'showrollbackconfirmation' preference [core] (wmf/1.35.0-wmf.35) - 10https://gerrit.wikimedia.org/r/603544 (https://phabricator.wikimedia.org/T254538) (owner: 10Krinkle)
[17:31:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson)
[17:35:38] <icinga-wm>	 RECOVERY - PHP opcache health on mw2275 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:37:45] <wikibugs>	 (03PS2) 10Mstyles: Update ML models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595019 (https://phabricator.wikimedia.org/T219534)
[17:42:55] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10jcrespo) Thank you for the help, putting the services back up.
[17:43:07] <wikibugs>	 (03PS1) 10Cmjohnson: Adding thanos-fe100[123] to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/603554 (https://phabricator.wikimedia.org/T251620)
[17:43:16] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.35/resources/src/mediawiki.misc-authed-curate/rollback.js: Fix: Diff pages show rollback confirmation prompt if there is the "Mark as patrolled" link (T254538) (duration: 00m 59s)
[17:43:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:20] <stashbot>	 T254538: Diff pages show rollback confirmation prompt if there is the "Mark as patrolled" link - https://phabricator.wikimedia.org/T254538
[17:49:27] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10jcrespo) @marostegui there seems to be a bug on 10.1.45-MariaDB installed locally, as the systemd unit doesn't notify the start (despite actually g...
[17:50:08] <elukey>	 !log restart prometheus burrow exporter for kafka main on kafkamon1001 - T254498
[17:50:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:12] <stashbot>	 T254498: reset of burrow metrics for consumer group - https://phabricator.wikimedia.org/T254498
[17:51:30] <elukey>	 Pchelolo: --^
[17:51:41] <wikibugs>	 (03PS7) 10Privacybatm: Write documentation using Sphinx [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219)
[17:52:20] <Pchelolo>	 elukey: thank you! doesn't reflect on the graphs yet
[17:52:40] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding thanos-fe100[123] to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/603554 (https://phabricator.wikimedia.org/T251620) (owner: 10Cmjohnson)
[17:52:49] <elukey>	 weird I still see cpjobqueue-low_traffic_jobs listed with details
[17:52:57] <elukey>	 in burrow I mean
[17:53:38] <elukey>	 Pchelolo: sure that the cgroup is not active anymore?
[17:53:44] <elukey>	 cpjobqueue-low_traffic_jobs
[17:54:03] <Pchelolo>	 it should be still active, but for a different subset of topics
[17:54:18] <elukey>	 https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=cpjobqueue-low_traffic_jobs
[17:54:24] <elukey>	 this seems ok though --^
[17:54:28] <Pchelolo>	 the consumer group is valid, we just subscribed it to a wrong set of topics
[17:54:49] <elukey>	 the graph seems updated no?
[17:55:03] <elukey>	 or does it show old topics?
[17:55:13] <Pchelolo>	 yup, looks good to me
[17:55:16] <Pchelolo>	 thank you elukey!
[17:55:19] <elukey>	 super :)
[17:57:07] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo trouble tickets TTN-0004144337, TTN-0004143746, and TTN-0004144096, for some reason. - The acknowledgement expires at: 2020-06-09 17:56:35. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:57:07] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo trouble tickets TTN-0004144337, TTN-0004143746, and TTN-0004144096, for some reason. - The acknowledgement expires at: 2020-06-09 17:56:35. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T1800).
[18:00:04] <jouncebot>	 Pchelolo and RoanKattouw: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:18] <RoanKattouw>	 I can do the SWAT
[18:00:58] <Pchelolo>	 cool, thank you RoanKattouw. Would you do mine or yours first?
[18:01:02] <Lucas_WMDE>	 the what now? ;)
[18:01:12] <RoanKattouw>	 Oh ey
[18:01:16] <RoanKattouw>	 Good rename, I like it
[18:01:23] <Lucas_WMDE>	 (I’m hanging around partly out of curiosity for the new log messages and the like)
[18:02:35] <RoanKattouw>	 OK this was only announced an hour ago, so I don't have to feel bad about missing the announcement email :)
[18:02:39] <Lucas_WMDE>	 ^^
[18:02:52] <Lucas_WMDE>	 yeah maybe I should’ve been less cryptic sorry
[18:02:58] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "Only use wg* for overriding core config keys. For things local to wmf-config, use wmg*." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko)
[18:03:30] <RoanKattouw>	 Pchelolo: Mind if I amend your patch to ---- yes that --^^
[18:03:35] <Pchelolo>	 yup
[18:03:47] <Pchelolo>	 oh, I mean, 'I don't mind'
[18:05:24] <wikibugs>	 (03PS3) 10Catrope: Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko)
[18:05:58] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko)
[18:06:38] <Pchelolo>	 RoanKattouw: I would need some time for testing it on mwdebug as well please
[18:06:56] <wikibugs>	 (03Merged) 10jenkins-bot: Disable HTCP purges for test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603514 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko)
[18:07:51] <RoanKattouw>	 Pchelolo: It's there, test away
[18:07:56] <Pchelolo>	 thank you
[18:10:07] <Pchelolo>	 RoanKattouw: mwdebug1001 or 1002?
[18:10:12] <RoanKattouw>	 1002 sorry
[18:11:16] <Pchelolo>	 RoanKattouw: All good! thank you
[18:11:49] <wikibugs>	 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10bd808) >>! In T254491#6200618, @hashar wrote: > The incident page <code>20200605-cloud-private-repo</code> has the date the page has been cr...
[18:15:28] <wikibugs>	 (03PS2) 10Dave Pifke: Use PDO for XHGui storage if configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603546 (https://phabricator.wikimedia.org/T180761)
[18:16:27] <wikibugs>	 (03CR) 10Jforrester: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar)
[18:19:36] <RoanKattouw>	 OK, deploying
[18:19:41] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10Marostegui) Yeah, I was testing the new version on that host with the new package and then I got into lots of others things. If you have some time...
[18:20:29] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable HTCP purges for testwiki (T250781) (part 1) (duration: 00m 59s)
[18:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:33] <stashbot>	 T250781: Move all purge traffic to kafka - https://phabricator.wikimedia.org/T250781
[18:23:13] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Disable HTCP purges for testwiki (T250781) (part 2) (duration: 00m 56s)
[18:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:04] <wikibugs>	 (03PS2) 10Catrope: GrowthExperiments: End A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603548 (https://phabricator.wikimedia.org/T254413)
[18:24:10] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] GrowthExperiments: End A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603548 (https://phabricator.wikimedia.org/T254413) (owner: 10Catrope)
[18:25:08] <wikibugs>	 (03PS2) 10Urbanecm: GrowthExperiments: Do not require opt-in for guidance in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603552 (https://phabricator.wikimedia.org/T254789)
[18:25:22] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: End A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603548 (https://phabricator.wikimedia.org/T254413) (owner: 10Catrope)
[18:28:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-fe1003.eqiad.wmnet ` The log can be found in...
[18:28:19] <logmsgbot>	 !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: End GrowthExperiments homepage A/B test (T254413) (duration: 00m 57s)
[18:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:22] <stashbot>	 T254413: Variant tests: switch all newcomers to Variant A - https://phabricator.wikimedia.org/T254413
[18:28:56] <Urbanecm>	 RoanKattouw: if you could review my GE patch too, it would be cool <https://gerrit.wikimedia.org/r/603552>
[18:29:12] <Urbanecm>	 (also, ping me when done, would like to do a couple of things)
[18:29:27] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Do not require opt-in for guidance in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603552 (https://phabricator.wikimedia.org/T254789) (owner: 10Urbanecm)
[18:29:39] <RoanKattouw>	 Thanks for finding and fixing that!
[18:29:45] <Urbanecm>	 no problem!
[18:30:09] <RoanKattouw>	 Beta updates its config automatically every 10 minutes, so you'll probably have to wait a little bit
[18:30:13] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Do not require opt-in for guidance in beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603552 (https://phabricator.wikimedia.org/T254789) (owner: 10Urbanecm)
[18:30:29] <Urbanecm>	 wasn't that in postmerge?
[18:31:10] <RoanKattouw>	 I think the beta deployment host getting the new config patch is in postmerge, but I don't think the full deployment (beta-scap-eqiad) is
[18:31:19] <Urbanecm>	 gotcha
[18:35:25] <Urbanecm>	 RoanKattouw: are you still deploying?
[18:35:33] <RoanKattouw>	 No, I'm done
[18:39:27] <Urbanecm>	 okay, thx
[18:39:53] <wikibugs>	 (03PS6) 10Urbanecm: Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21)
[18:40:00] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21)
[18:40:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe1003.eqiad.wmnet'] `  Of which those **FAILED**: ` ['thanos-fe1003.eqiad.wmnet'] `
[18:42:01] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime
[18:42:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:35] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:44:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21)
[18:48:08] <wikibugs>	 (03Merged) 10jenkins-bot: Enable subpages in Page namespace on napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596477 (https://phabricator.wikimedia.org/T252755) (owner: 10Zoranzoki21)
[18:49:11] <wikibugs>	 (03PS10) 10Urbanecm: IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) (owner: 10Zoranzoki21)
[18:51:10] <wikibugs>	 (03PS11) 10Urbanecm: IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) (owner: 10Zoranzoki21)
[18:51:28] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0e85203: Enable subpages in Page namespace on napwikisource (T252755) (duration: 00m 58s)
[18:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:32] <stashbot>	 T252755: Add subpages in ns Page for nap.source - https://phabricator.wikimedia.org/T252755
[18:52:33] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) (owner: 10Zoranzoki21)
[18:53:25] <wikibugs>	 (03Merged) 10jenkins-bot: IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) (owner: 10Zoranzoki21)
[18:54:43] <icinga-wm>	 PROBLEM - PHP opcache health on mw2233 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[18:55:03] <wikibugs>	 (03PS6) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459
[18:55:12] <logmsgbot>	 !log urbanecm@deploy1001 sync-file aborted: SWAT: 1630a10: Set wgProofreadPagePageJoiner to __PAGEJOIN__ for zhwikisource (duration: 00m 00s)
[18:55:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wip [puppet] - 10https://gerrit.wikimedia.org/r/602459 (owner: 10Herron)
[18:56:16] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 1630a10: Set wgProofreadPagePageJoiner to __PAGEJOIN__ for zhwikisource (T205826) (duration: 00m 58s)
[18:56:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:19] <stashbot>	 T205826: Set wgProofreadPagePageJoiner on zh.wikisource - https://phabricator.wikimedia.org/T205826
[18:56:33] <Urbanecm>	 !log Morning <del>SWAT</del>config/backport window done
[18:56:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:00] <wikibugs>	 (03PS7) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459
[19:04:20] <wikibugs>	 (03PS1) 10CDanis: run-puppet-agent: add new flag --unless-version SUBSTR [puppet] - 10https://gerrit.wikimedia.org/r/603577
[19:04:45] <icinga-wm>	 RECOVERY - PHP opcache health on mw2233 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:06:05] <wikibugs>	 (03PS8) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459
[19:06:40] <wikibugs>	 (03PS2) 10CDanis: run-puppet-agent: add new flag --unless-version SUBSTR [puppet] - 10https://gerrit.wikimedia.org/r/603577
[19:09:13] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: wdqs updater should be better isolated from blazegraph and common workload should be shared between servers - https://phabricator.wikimedia.org/T207837 (10Gehel) 05Open→03Declined This is being addressed as part of T244590
[19:09:17] <wikibugs>	 10Operations, 10Analytics, 10Event-Platform, 10Wikidata, and 7 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10Gehel)
[19:10:22] <wikibugs>	 (03PS9) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459
[19:15:13] <icinga-wm>	 PROBLEM - PHP opcache health on mw2235 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:15:17] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:15:52] <wikibugs>	 (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/23075/" [puppet] - 10https://gerrit.wikimedia.org/r/602459 (owner: 10Herron)
[19:17:03] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:17:45] <wikibugs>	 (03PS10) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459
[19:23:25] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "This looks great! Thanks for the tip" [puppet] - 10https://gerrit.wikimedia.org/r/602771 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond)
[19:24:47] <wikibugs>	 (03PS11) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459
[19:30:18] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Seems reasonable, at the same time we should convert all those to Python :D" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/603577 (owner: 10CDanis)
[19:32:41] <icinga-wm>	 PROBLEM - PHP opcache health on mw2231 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:35:57] <wikibugs>	 (03PS2) 10Aaron Schulz: Enable "coalesceKeys" for global keys for WANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598855
[19:40:51] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:42:41] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:43:33] <icinga-wm>	 RECOVERY - PHP opcache health on mw2231 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:47:06] <wikibugs>	 (03PS12) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/602459
[19:51:53] <icinga-wm>	 PROBLEM - PHP opcache health on mw2277 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:53:11] <icinga-wm>	 RECOVERY - PHP opcache health on mw2235 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:56:18] <wikibugs>	 (03CR) 10Herron: [C: 04-2] "getting there...  PCC looks okay-ish https://puppet-compiler.wmflabs.org/compiler1001/23079/ but this is not yet safe to merge" [puppet] - 10https://gerrit.wikimedia.org/r/602459 (owner: 10Herron)
[20:00:04] <jouncebot>	 halfak and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T2000).
[20:01:06] <wikibugs>	 (03PS13) 10Herron: elasticsearch: manage java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913)
[20:01:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: manage java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron)
[20:02:56] <wikibugs>	 (03CR) 10Herron: [C: 04-2] "I'll leave this a -2 and as WIP, but requesting initial feedback as this is fairly wide reaching with risk of breakage." [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron)
[20:04:39] <wikibugs>	 (03PS14) 10Herron: elasticsearch: manage java dependencies with ::profile::java [puppet] - 10https://gerrit.wikimedia.org/r/602459 (https://phabricator.wikimedia.org/T252913)
[20:06:25] <icinga-wm>	 RECOVERY - PHP opcache health on mw2277 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[20:10:17] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10Nuria) 05Open→03Resolved
[20:16:15] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10lmata)
[20:22:07] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[20:23:03] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10wiki_willy)
[20:24:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10wiki_willy)
[20:24:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10wiki_willy)
[20:24:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10wiki_willy)
[20:24:51] <icinga-wm>	 PROBLEM - PHP opcache health on wtp2002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[20:25:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10wiki_willy)
[20:26:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10wiki_willy)
[20:27:03] <RoanKattouw>	 !log Running initUserPreference.php -s growthexperiments-homepage-enable -t growthexperiments-help-panel-tog-help-panel on wikis that have GrowthExperiments installed (T240920)
[20:27:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:08] <stashbot>	 T240920: Variant tests: turn on help panel for homepage people - https://phabricator.wikimedia.org/T240920
[20:28:09] <icinga-wm>	 PROBLEM - PHP opcache health on mw2274 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[20:32:03] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Chromium-render: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway)
[20:32:34] <wikibugs>	 (03Merged) 10jenkins-bot: Chromium-render: Add initial helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway)
[20:34:33] <wikibugs>	 10Operations, 10Wikimedia-SVG-rendering: Install (currently non-existing) Debian packages for PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181 (10Aklapper) 05Stalled→03Open Reopening per last comment.
[20:37:35] <icinga-wm>	 RECOVERY - PHP opcache health on wtp2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[20:39:03] <icinga-wm>	 RECOVERY - PHP opcache health on mw2274 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[20:52:02] <Amir1>	 !log applying the sql alter table on [[gerrit:594292|ipblocks]] on labswiki (T251188)
[20:52:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:10] <stashbot>	 T251188: ipb_address_unique has an extra column in production but not in the code - https://phabricator.wikimedia.org/T251188
[20:53:13] <wikibugs>	 (03PS2) 10Cwhite: hiera: install mtail 3.0.0~rc35 from component in ulsfo and codfw [puppet] - 10https://gerrit.wikimedia.org/r/601874 (https://phabricator.wikimedia.org/T251466)
[21:00:04] <jouncebot>	 Reedy and sbassett: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T2100).
[21:02:18] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "PCC checks out https://puppet-compiler.wmflabs.org/compiler1003/23080/" [puppet] - 10https://gerrit.wikimedia.org/r/601874 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite)
[21:03:17] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] "Hi Alex, even after waiting on the next Puppet run after this was merged, it doesn't appear that Puppet has created the .hfenv files and p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602164 (https://phabricator.wikimedia.org/T225680) (owner: 10Mholloway)
[21:06:55] <icinga-wm>	 PROBLEM - PHP opcache health on mw2242 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:17:40] <wikibugs>	 (03PS1) 10Cwhite: hiera: set mtail disable_fsnotify in codfw and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/603614 (https://phabricator.wikimedia.org/T251466)
[21:21:27] <icinga-wm>	 RECOVERY - PHP opcache health on mw2242 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:22:54] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: set mtail disable_fsnotify in codfw and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/603614 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite)
[21:26:02] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10stjn) >>! In T242767#6202740, @Ottomata wrote: > I guess I'd like to hear from the EventStreams users on...
[21:40:58] <wikibugs>	 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul)
[21:43:49] <icinga-wm>	 PROBLEM - PHP opcache health on mw2269 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:53:51] <icinga-wm>	 PROBLEM - PHP opcache health on mw2326 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:54:41] <icinga-wm>	 RECOVERY - PHP opcache health on mw2269 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[21:55:52] <wikibugs>	 (03PS1) 10Cwhite: hiera: add disable_fsnotify mtail flag to ncredir [puppet] - 10https://gerrit.wikimedia.org/r/603626 (https://phabricator.wikimedia.org/T251466)
[21:57:22] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: add disable_fsnotify mtail flag to ncredir [puppet] - 10https://gerrit.wikimedia.org/r/603626 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite)
[21:58:45] <wikibugs>	 (03CR) 10AntiCompositeNumber: [C: 03+1] "Looks good from here as well." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/603386 (https://phabricator.wikimedia.org/T254557) (owner: 10Gilles)
[22:00:57] <wikibugs>	 (03PS1) 10Cwhite: hiera: move ncredir config to profile [puppet] - 10https://gerrit.wikimedia.org/r/603628 (https://phabricator.wikimedia.org/T251466)
[22:01:46] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: move ncredir config to profile [puppet] - 10https://gerrit.wikimedia.org/r/603628 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite)
[22:03:36] <wikibugs>	 (03PS1) 10Cmjohnson: Adding relforge100[34] to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/603634 (https://phabricator.wikimedia.org/T241791)
[22:04:06] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding relforge100[34] to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/603634 (https://phabricator.wikimedia.org/T241791) (owner: 10Cmjohnson)
[22:12:01] <icinga-wm>	 RECOVERY - PHP opcache health on mw2326 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[22:13:33] <wikibugs>	 (03PS1) 10Cmjohnson: Adding thanos-fe100[1-3] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/603637 (https://phabricator.wikimedia.org/T251620)
[22:20:08] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers  in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Papaul) switch ports removed for mw2154 through mw2186
[22:21:35] <wikibugs>	 (03PS2) 10Cmjohnson: Adding thanos-fe100[1-3] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/603637 (https://phabricator.wikimedia.org/T251620)
[22:22:53] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding thanos-fe100[1-3] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/603637 (https://phabricator.wikimedia.org/T251620) (owner: 10Cmjohnson)
[22:23:46] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10wiki_willy) a:03Papaul
[22:26:05] <wikibugs>	 10Operations, 10Core Platform Team, 10Traffic: Move wikitech purges to kafka - https://phabricator.wikimedia.org/T254828 (10Pchelolo)
[22:26:50] <wikibugs>	 (03Abandoned) 10Urbanecm: Remove unused logos from /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521282 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm)
[22:27:17] <wikibugs>	 (03Abandoned) 10Urbanecm: [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm)
[22:33:55] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 64 probes of 577 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:34:55] <wikibugs>	 (03PS1) 10Cmjohnson: Adding thanos-be100[1-4] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/603648 (https://phabricator.wikimedia.org/T251618)
[22:36:42] <wikibugs>	 (03PS2) 10Cmjohnson: Adding thanos-be100[1-4] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/603648 (https://phabricator.wikimedia.org/T251618)
[22:37:37] <wikibugs>	 (03PS1) 10Ppchelko: [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781)
[22:38:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko)
[22:41:06] <wikibugs>	 (03PS2) 10Ppchelko: [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781)
[22:45:23] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding thanos-be100[1-4] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/603648 (https://phabricator.wikimedia.org/T251618) (owner: 10Cmjohnson)
[22:45:46] <wikibugs>	 (03PS3) 10Ppchelko: [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781)
[22:45:49] <icinga-wm>	 PROBLEM - PHP opcache health on wtp2003 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[22:46:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko)
[22:47:22] <wikibugs>	 (03PS1) 10BryanDavis: Pywikibot container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787)
[22:48:47] <wikibugs>	 (03PS4) 10Ppchelko: [No-op]: Add precautions for kafka-purges before transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603649 (https://phabricator.wikimedia.org/T250781)
[22:49:09] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade
[22:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:24] <wikibugs>	 (03CR) 10BryanDavis: "I'm not sure this is best way to approach the problem, but I thought I would at least get my work out of a local directory and into gerrit" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787) (owner: 10BryanDavis)
[22:52:48] <wikibugs>	 (03PS1) 10Ppchelko: Enable kafka purges everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603654 (https://phabricator.wikimedia.org/T250781)
[22:52:50] <wikibugs>	 (03PS1) 10Ppchelko: Disbalse HTCP purges where kafka purges are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603655 (https://phabricator.wikimedia.org/T250781)
[22:53:09] <shdubsh>	 !log update mtail to 3.0.0~rc35 on mw and wtp hosts codfw
[22:53:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:13] <wikibugs>	 (03PS2) 10Ppchelko: Disable HTCP purges where kafka purges are enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603655 (https://phabricator.wikimedia.org/T250781)
[22:53:18] <logmsgbot>	 !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99)
[22:53:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:38] <wikibugs>	 (03PS1) 10Cmjohnson: Add thanos-be100[1234] to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/603656 (https://phabricator.wikimedia.org/T251618)
[22:54:20] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Add thanos-be100[1234] to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/603656 (https://phabricator.wikimedia.org/T251618) (owner: 10Cmjohnson)
[22:58:16] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade
[22:58:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200608T2300).
[23:02:22] <logmsgbot>	 !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99)
[23:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:02:59] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 577 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:09:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-fe1003.eqiad.wmne...
[23:11:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1001.eqiad.wmnet ` The log can be found in `/var...
[23:13:54] <wikibugs>	 (03PS3) 10Krinkle: logging: Combine the three custom Monolog processors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601813
[23:14:05] <wikibugs>	 (03PS2) 10Krinkle: logging: Omit 'unique_id' from WebProcessor mixin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601814 (https://phabricator.wikimedia.org/T253677)
[23:14:06] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1002.eqiad.wmnet ` The log can be found in `/var...
[23:14:15] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] logging: Combine the three custom Monolog processors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601813 (owner: 10Krinkle)
[23:15:28] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Combine the three custom Monolog processors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601813 (owner: 10Krinkle)
[23:15:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1003.eqiad.wmnet ` The log can be found in `/var...
[23:20:15] <icinga-wm>	 RECOVERY - PHP opcache health on wtp2003 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[23:21:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1001.eqiad.wmnet ` The log can be found in `/var...
[23:21:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1002.eqiad.wmnet ` The log can be found in `/var...
[23:21:44] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1003.eqiad.wmnet ` The log can be found in `/var...
[23:22:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] - https://phabricator.wikimedia.org/T251618 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` thanos-be1004.eqiad.wmnet ` The log can be found in `/var...
[23:23:07] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime
[23:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:33] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[23:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:01] * Krinkle testing on mwdebug1002
[23:29:49] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe1003.eqiad.wmnet'] `  and were **ALL** successful.
[23:32:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson)
[23:32:43] <foks>	 !log removing one file for legal compliance
[23:32:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson) thanos-fe1003 is the only one installed at the moment.    thanos-fe1001 mgmt is not working, - need to check cable thanos-fe1002 does not appea...
[23:33:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:34:30] <icinga-wm>	 PROBLEM - PHP opcache health on mw2193 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[23:35:45] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/logging.php: I8c22a1a8fc402 (duration: 00m 58s)
[23:35:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:54] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:36:21] <wikibugs>	 (03CR) 10Legoktm: Add html web image (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601839 (https://phabricator.wikimedia.org/T241817) (owner: 10Legoktm)
[23:37:33] <wikibugs>	 (03PS2) 10Legoktm: Add html web image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601839 (https://phabricator.wikimedia.org/T241817)
[23:38:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] logging: Omit 'unique_id' from WebProcessor mixin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601814 (https://phabricator.wikimedia.org/T253677) (owner: 10Krinkle)
[23:38:45] <wikibugs>	 (03PS1) 10Legoktm: Drop fam from everywhere [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603667
[23:39:31] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Omit 'unique_id' from WebProcessor mixin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601814 (https://phabricator.wikimedia.org/T253677) (owner: 10Krinkle)
[23:39:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:40:47] <wikibugs>	 (03PS1) 10BryanDavis: Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640)
[23:41:24] <wikibugs>	 (03CR) 10Legoktm: "I didn't see this before I pushed Change-Id: Ibc99d13d63340cde3c5fdcd3c3c5a7a9255b3d76, but that already has the drop from build.py part i" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/595966 (owner: 10BryanDavis)
[23:41:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640) (owner: 10BryanDavis)
[23:42:29] <wikibugs>	 (03Abandoned) 10BryanDavis: Remove unused static-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/595966 (owner: 10BryanDavis)
[23:42:30] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:43:00] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] Drop unused static-web-sssd image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601838 (owner: 10Legoktm)
[23:45:08] <icinga-wm>	 RECOVERY - PHP opcache health on mw2193 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[23:46:17] <wikibugs>	 (03PS1) 10Cwhite: hiera: add disable_fsnotify flag for mtail in codfw [puppet] - 10https://gerrit.wikimedia.org/r/603673 (https://phabricator.wikimedia.org/T251466)
[23:47:55] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] "Confirmed via mwdebug1002 that two logically identical message documents look the same before/after, except without the confusiong unique_" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601814 (https://phabricator.wikimedia.org/T253677) (owner: 10Krinkle)
[23:48:17] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: add disable_fsnotify flag for mtail in codfw [puppet] - 10https://gerrit.wikimedia.org/r/603673 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite)
[23:49:12] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/logging.php: If991929c84ff69 (duration: 00m 57s)
[23:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:41] <wikibugs>	 (03PS2) 10BryanDavis: Disable tool name alias in lighttpd config with --canonical [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/603668 (https://phabricator.wikimedia.org/T254640)
[23:52:08] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] Drop unused static-web-sssd image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601838 (owner: 10Legoktm)
[23:52:39] <wikibugs>	 (03Merged) 10jenkins-bot: Drop unused static-web-sssd image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/601838 (owner: 10Legoktm)
[23:58:47] <icinga-wm>	 PROBLEM - PHP opcache health on mw2197 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health