[00:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T0000). Please do the needful. [00:19:36] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Varnent) @jbond - correct - that is the mailing list that we are referring to. :) [00:23:55] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 27.76 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:24:45] 10Operations, 10Cassandra, 10Dependency-Tracking, 10Wikibase-Quality, and 7 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10Addshore) 05Open→03Stalled Stalled on the RFC [00:25:23] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 75.95 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:40:53] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:40:53] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:40:53] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:40:53] PROBLEM - wiki content on commons on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/ [00:40:53] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:40:53] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:40:53] hmm [00:40:53] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:40:53] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:40:53] PROBLEM - https://phabricator.wikimedia.org on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [00:40:53] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [00:40:53] phan and grafana are not loading for me [00:40:53] *phab [00:40:53] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 58.92 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:40:53] wikipedia is not loading either or is very slow [00:40:53] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:40:53] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:40:53] PROBLEM - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/71/ [00:40:53] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:40:53] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:40:53] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:40:53] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:40:53] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:40:53] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:40:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:40:53] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [00:47:21] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [00:59:04] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 2.420 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:59:31] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:59:35] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [01:00:49] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [01:00:53] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:00:59] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [01:01:20] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:01:45] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [01:01:51] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:01:55] PROBLEM - https://phabricator.wikimedia.org on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [01:01:57] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:01:57] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:01:57] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:02:03] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:02:14] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:03:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 55.81 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:03:12] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:03:12] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:03:19] RECOVERY - https://phabricator.wikimedia.org on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 36523 bytes in 0.339 second response time https://wikitech.wikimedia.org/wiki/Phabricator [01:03:20] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:03:20] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:03:23] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:03:23] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:03:23] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:03:31] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:06:09] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 100.8 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:06:19] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:07:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:07:33] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [01:07:38] PROBLEM - wiki content on commons on commons.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection. https://phabricator.wikimedia.org/project/view/1118/ [01:07:43] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:07:45] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:07:51] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [01:07:56] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:07:57] PROBLEM - https://phabricator.wikimedia.org on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [01:07:59] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:07:59] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:07:59] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:08:03] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:08:05] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:08:06] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:08:06] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:08:19] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:08:19] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:08:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:08:41] wtf [01:08:58] Prod is unwell. [01:08:59] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [01:09:09] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:09:09] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:09:10] RECOVERY - wiki content on commons on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 162825 bytes in 1.035 second response time https://phabricator.wikimedia.org/project/view/1118/ [01:09:49] In case this is news to folks, might be worth noting that this does not seem to be wikimedia-specific. A LOT of things have gone down. [01:09:53] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:09:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:09:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:09:58] PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:09:58] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:10:16] I'm just going to assume sharks ate the internet tubes and we're all doomed. [01:10:47] network issues i think Isarra [01:10:47] (03PS1) 10BBlack: emergency depool eqiad front edge [dns] - 10https://gerrit.wikimedia.org/r/519319 [01:10:47] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:10:49] RECOVERY - https://phabricator.wikimedia.org on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 36523 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Phabricator [01:10:53] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:10:53] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:10:55] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:10:55] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:11:01] yes there's an ongoing issue being addressed, SRE is working on it [01:11:21] (03CR) 10BBlack: [C: 03+2] emergency depool eqiad front edge [dns] - 10https://gerrit.wikimedia.org/r/519319 (owner: 10BBlack) [01:11:37] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:11:54] !log depool eqiad front edge [01:12:02] !log depool eqiad front edge (in DNS, I meant) [01:12:39] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:12:54] RECOVERY - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 3.327 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:13:05] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [01:13:06] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:13:06] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:13:07] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:13:07] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:13:09] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:13:09] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:13:09] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:13:12] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:13:13] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [01:13:14] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% [01:13:15] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:13:17] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:13:20] PROBLEM - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:13:23] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:13:25] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:13:55] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:05] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:14:06] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:14:06] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:14:38] I'm getting timeouts and a weird ERR_SSL_VERSION_INTERFERENCE error trying to load logged in and anonymous pages. Works fine on my phone but not on a couple computers over here. Same issue on Firefox, Chrome, and Chromium. [01:14:46] /mode #wikimedia-operations +o andrewbogott [01:14:50] bah! [01:14:57] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:14:57] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:14:57] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:15:05] Other websites seem to be fine. [01:15:05] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:15:08] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:15:11] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [01:15:12] PROBLEM - Host text-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:15:15] niedzielski it's known, SRE are looking. [01:15:43] andrewbogott: `/cs op #wikimedia-operations andrewbogott`. :-) [01:15:51] But also sharks. [01:16:03] :) [01:16:05] thanks James_F [01:16:14] thanks paladox ! [01:16:40] Thanks, andrewbogott. [01:16:57] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:17:02] RECOVERY - Host text-lb.eqiad.wikimedia.org is UP: PING WARNING - Packet loss = 37%, RTA = 10.08 ms [01:17:20] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [01:17:23] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:23] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:26] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:17:27] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:17:27] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:17:29] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:35] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [01:17:35] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:35] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:35] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:35] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:36] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:37] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:39] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:17:39] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [01:17:40] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15746 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:17:43] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [01:17:53] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:17:55] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:17:59] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:18:11] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [01:19:21] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:19:21] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [01:19:29] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:19:33] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:20:46] RECOVERY - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:20:59] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:21:16] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 36.14 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:21:30] PROBLEM - wiki content on commons on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/ [01:22:20] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:22:36] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15758 bytes in 1.101 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:22:52] RECOVERY - wiki content on commons on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 162877 bytes in 0.271 second response time https://phabricator.wikimedia.org/project/view/1118/ [01:24:33] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:25:10] PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:25:14] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15745 bytes in 0.746 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:25:19] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw1257.eqiad.wmnet, mw1246.eqiad.wmnet, mw1238.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:25:30] surprisingly gerrit is the only service working for me. [01:26:57] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:27:35] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:28:31] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1242.eqiad.wmnet, mw1238.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [01:29:52] PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:30:09] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) t [01:30:09] response was received https://wikitech.wikimedia.org/wiki/RESTBase [01:31:35] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [01:31:42] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 54%, RTA = 84.42 ms [01:32:04] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 58%, RTA = 84.62 ms [01:32:28] RECOVERY - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:33:51] PROBLEM - pybal on lvs1013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [01:34:01] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [01:34:13] PROBLEM - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal [01:34:37] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [01:34:54] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ed1a::1) [01:35:19] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 2.321e+05 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:35:46] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:37:13] PROBLEM - PyBal connections to etcd on lvs1013 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [01:38:48] RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15746 bytes in 2.788 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:38:48] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1249.eqiad.wmnet, mw1254.eqiad.wmnet, mw1248.eqiad.wmnet, mw1252.eqiad.wmnet, mw1255.eqiad.wmnet, mw1246.eqiad.wmnet are marked down but pooled: apaches_80: Servers mw1241.eqiad.wmnet, mw1243.eqiad.wmnet, mw1256.eqiad.wmnet, mw1249.eqiad.wmnet, mw1246.eqiad.wmnet, mw1258.eqiad.wmnet, mw1239.eqiad.wmnet are mar [01:38:48] ed https://wikitech.wikimedia.org/wiki/PyBal [01:39:02] PROBLEM - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:40:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:40:22] RECOVERY - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:41:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:41:42] (03PS1) 10BBlack: emergency depool esams front edge [dns] - 10https://gerrit.wikimedia.org/r/519320 [01:43:14] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% [01:43:15] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:43:30] (03CR) 10BBlack: [C: 03+2] emergency depool esams front edge [dns] - 10https://gerrit.wikimedia.org/r/519320 (owner: 10BBlack) [01:43:45] !log depool esams front edge [01:47:20] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [01:47:39] PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [01:47:41] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal [01:52:31] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [01:54:41] PROBLEM - pybal on lvs1013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [01:59:51] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 70.71 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:01:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 27 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:04:17] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 45.9 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:04:53] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 0.04016 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:05:44] PROBLEM - kvm ssl cert on cloudvirt1024 is CRITICAL: Certificate will expire https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:05:49] PROBLEM - kvm ssl cert on cloudvirt1018 is CRITICAL: Certificate will expire https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:07:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:07:15] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 79.08 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:07:19] (03CR) 10Catrope: [C: 03+1] Betalabs: Enable GrowthExperiments features for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan) [02:08:20] ACKNOWLEDGEMENT - kvm ssl cert on cloudvirt1018 is CRITICAL: Certificate will expire andrew bogott non-urgent but discussed in T225484 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:08:21] ACKNOWLEDGEMENT - kvm ssl cert on cloudvirt1024 is CRITICAL: Certificate will expire andrew bogott non-urgent but discussed in T225484 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:09:21] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 169.7 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:11:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:12:29] !log lvs3001: powercycle, unresponsive console [02:13:13] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% [02:15:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:15:56] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 80%, RTA = 84.36 ms [02:15:57] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:16:07] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 50%, RTA = 84.32 ms [02:17:14] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [02:17:14] !log lvs3003: powercycle, unresponse console [02:18:54] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.33 ms [02:19:08] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 0.519 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:19:31] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:20:09] RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 83.44 ms [02:24:15] RECOVERY - pybal on lvs1013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [02:24:27] RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [02:24:31] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:24:35] RECOVERY - PyBal backends health check on lvs1013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:25:19] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 67 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [02:26:25] RECOVERY - PyBal connections to etcd on lvs1013 is OK: OK: 4 connections established with conf1004.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [02:26:35] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 4 probes of 471 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:28:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:32:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% [02:39:45] (03PS1) 10BBlack: Revert "emergency depool esams front edge" [dns] - 10https://gerrit.wikimedia.org/r/519325 [02:39:48] (03PS1) 10BBlack: Revert "emergency depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/519326 [02:40:27] !log re-pooling esams+eqiad front edge traffic [02:40:34] (03CR) 10BBlack: [C: 03+2] Revert "emergency depool esams front edge" [dns] - 10https://gerrit.wikimedia.org/r/519325 (owner: 10BBlack) [02:40:40] (03CR) 10BBlack: [C: 03+2] Revert "emergency depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/519326 (owner: 10BBlack) [02:47:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:47:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:48:15] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:48:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:48:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:48:29] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:48:47] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:49:01] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:49:26] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:49:26] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:49:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:49:51] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [02:50:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:51:07] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [02:51:09] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [02:51:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:52:01] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 49.36 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:52:14] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [02:52:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:53:27] PROBLEM - https://phabricator.wikimedia.org on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [02:53:29] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [02:53:36] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:53:49] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [02:54:47] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:54:49] RECOVERY - https://phabricator.wikimedia.org on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 36522 bytes in 1.484 second response time https://wikitech.wikimedia.org/wiki/Phabricator [02:54:51] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedi [02:54:51] se [02:55:09] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:55:27] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:55:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:56:30] PROBLEM - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:56:33] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:35] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:56:38] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15809 bytes in 8.992 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:56:43] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:59] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:59] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:59] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:57:03] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [02:57:16] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 107 probes of 471 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:57:43] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:57:45] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:57:57] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [02:57:59] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:05] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:06] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:11] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:58:11] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [02:58:13] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:58:21] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:21] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:21] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:29] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [02:58:31] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:58:31] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:58:45] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 47.92 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:59:01] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 56 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:59:13] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /a [02:59:13] nnouncements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:59:43] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:00:32] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2804 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:00:41] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 101 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:00:56] RECOVERY - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15848 bytes in 4.393 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:01:10] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:01:45] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 82.39 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:02:13] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:02:27] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2846 bytes in 0.455 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:02:34] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15811 bytes in 0.517 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:03:13] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% [03:03:36] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15849 bytes in 0.471 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:05:32] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15862 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:07:13] 04Critical Alert for device cr2-eqsin.wikimedia.org - Primary inbound port utilisation over 80% [03:08:21] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [03:09:59] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:10:05] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:10:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:10:41] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:10:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:10:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:10:53] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:11:21] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:11:29] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:11:33] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [03:11:37] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 21 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:11:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:15:01] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [03:15:37] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [03:16:21] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:16:29] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [03:16:41] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [03:16:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:17:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [03:17:41] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28236 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:18:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:25:05] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:26:13] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 74.07 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:28:01] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:28:42] (03PS1) 10BBlack: depool eqsin and eqiad front edges [dns] - 10https://gerrit.wikimedia.org/r/519334 [03:29:12] !log depooling eqsin + eqiad edges [03:29:31] (03CR) 10BBlack: [C: 03+2] depool eqsin and eqiad front edges [dns] - 10https://gerrit.wikimedia.org/r/519334 (owner: 10BBlack) [03:33:14] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% [03:37:13] 04Critical Alert for device cr2-eqsin.wikimedia.org - Primary inbound port utilisation over 80% [03:38:20] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [03:38:59] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 42.82 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:40:15] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 31.82 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:40:46] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 117 probes of 471 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:47:25] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 59 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:52:20] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% [03:52:53] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 19 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:56:45] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [04:02:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 80% [04:02:37] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 471 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [04:03:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% [04:15:25] !log lvs1013: enable puppet + pybal [04:15:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:16:52] !log lvs1016: enable puppet + pybal [04:18:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:20:07] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 56.75 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:20:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:26:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:27:19] (03PS1) 10BBlack: Revert "depool eqsin and eqiad front edges" [dns] - 10https://gerrit.wikimedia.org/r/519342 [04:28:56] (03CR) 10BBlack: [C: 03+2] Revert "depool eqsin and eqiad front edges" [dns] - 10https://gerrit.wikimedia.org/r/519342 (owner: 10BBlack) [04:30:04] !log re-pooling eqsin+eqiad front edges [04:31:53] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 84.31 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:32:07] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 95.25 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:34:14] !log Start replication on labsdb1011 - T222978 [04:34:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:34:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:34:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:35:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:36:05] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [04:36:15] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [04:36:19] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:36:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:36:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [04:37:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:37:47] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:38:18] !log Stop MySQL on dbstore1005 for upgrade - T226358 [04:38:33] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.21 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:38:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:38:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:39:01] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [04:39:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:39:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:40:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:40:35] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 53.45 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:40:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:40:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:40:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:40:41] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:40:51] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:41:05] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [04:41:21] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:41:33] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:41:33] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:43:40] 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10A2093064) [04:44:16] 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Marostegui) We are looking into general connectivity issues at the moment [04:44:23] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.3036 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [04:45:54] <[1997kB]> cp1087, Varnish XID 308445704 [04:45:55] <[1997kB]> Error: 503, Backend fetch failed at Thu, 27 Jun 2019 04:44:55 GMT [04:47:19] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [04:47:37] [1997kB]: we are looking into some connectivity issues at the moment [04:49:12] <[1997kB]> alright. ty [04:49:27] !log restarting varnish-backend on cp1087 (seems unhealthy!) [04:51:49] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:52:17] 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Marostegui) @BBlack restarted varnish on that host. It should be ok now. [04:52:32] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2743 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:52:52] PROBLEM - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2829 bytes in 1.399 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:53:00] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2822 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:53:06] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:53:26] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2846 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:54:02] RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15757 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:54:25] some problem on Commons? [04:54:47] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:54:50] Request from 106.213.175.182 via cp1077 cp1077, Varnish XID 532578964 [04:54:50] Error: 503, Backend fetch failed [04:54:53] 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Pruem) cp1077 is also producing this right now. [04:55:04] yannf: we are having connectivity issues, we are on it [04:55:05] There are general coonnectivity issues for all sites right now. [04:55:06] PROBLEM - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2821 bytes in 1.402 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:55:07] yannf: i am seeing in meta too [04:55:15] marostegui, ok thanks [04:55:31] 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Marostegui) We are having general connectivity issues [04:56:24] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15858 bytes in 0.462 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:56:36] RECOVERY - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15848 bytes in 1.268 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:57:24] RECOVERY - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15861 bytes in 1.256 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:57:31] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15848 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:00:26] 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Antigng) It seems that only requests coming through Varnish frontends at eqiad are affected. [05:01:31] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:02:01] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:02:13] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:02:13] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:02:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:02:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:02:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:02:49] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:03:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:03:55] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:07:01] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [05:07:09] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [05:07:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:07:55] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:09:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:09:56] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [05:11:33] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 71.91 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:13:56] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.99 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:15:38] (03PS1) 10Marostegui: install_server: Do not re-image db1133 and dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/519343 [05:17:57] (03CR) 10Marostegui: [C: 03+2] install_server: Do not re-image db1133 and dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/519343 (owner: 10Marostegui) [05:26:03] 10Operations, 10ops-eqiad, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Marostegui) [05:27:00] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:27:14] !log Remove db1068 from tendril and zarcillo - T226689 [05:28:48] (03PS1) 10Marostegui: mariadb: Decommission db1068 [puppet] - 10https://gerrit.wikimedia.org/r/519344 [05:29:36] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Decommission db1068 [puppet] - 10https://gerrit.wikimedia.org/r/519344 (owner: 10Marostegui) [05:31:10] (03PS2) 10Marostegui: mariadb: Decommission db1068 [puppet] - 10https://gerrit.wikimedia.org/r/519344 [05:33:35] 10Operations, 10ops-eqiad, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Marostegui) p:05Triage→03Normal [05:40:29] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:40:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:41:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:41:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:41:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:41:01] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:41:15] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:41:43] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:41:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:41:55] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:42:13] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [05:42:51] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:43:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:43:41] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [05:43:51] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [05:44:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:47:39] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:47:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:47:51] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:47:51] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:48:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:48:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:48:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:48:26] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:48:26] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:48:39] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [05:52:01] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:52:41] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [05:52:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:54:01] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [05:54:01] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [05:54:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [05:59:53] (03PS4) 10Gilles: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) [06:00:12] (03CR) 10Gilles: "Added a flag to toggle the feature on/off" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [06:00:30] (03CR) 10jerkins-bot: [V: 04-1] Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [06:02:26] (03PS5) 10Gilles: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) [06:21:24] (03PS6) 10Elukey: analytics::refinery::job::data_purge add deletion for data_quality_hourly [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [06:27:53] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge add deletion for data_quality_hourly [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [06:30:29] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:31:01] PROBLEM - puppet last run on db2108 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:31:57] PROBLEM - puppet last run on debmonitor2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:33:47] !log restart rsyslog on wezen - T199406 [06:35:21] !log restart rsyslog on lithium - T199406 [06:35:27] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [06:35:35] both were stuck [06:36:29] RECOVERY - puppet last run on db2108 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:38:01] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 849 days) https://wikitech.wikimedia.org/wiki/Logs [06:43:11] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:17] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:10] RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:58:10] RECOVERY - puppet last run on debmonitor2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:53] (03PS1) 10Muehlenhoff: Mask the default uwsgi service on puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/519350 [07:03:20] (03CR) 10jerkins-bot: [V: 04-1] Mask the default uwsgi service on puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/519350 (owner: 10Muehlenhoff) [07:05:42] (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/519350 (owner: 10Muehlenhoff) [07:08:14] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/519350 (owner: 10Muehlenhoff) [07:11:38] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:11:54] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1068 [puppet] - 10https://gerrit.wikimedia.org/r/519344 (owner: 10Marostegui) [07:24:08] (03PS3) 10Marostegui: mariadb: Decommission db1068 [puppet] - 10https://gerrit.wikimedia.org/r/519344 [07:26:55] !log Stop MySQL on db1068 for decommission - T226689 [07:28:55] 10Operations, 10ops-eqiad, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Marostegui) [07:29:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Marostegui) a:05Marostegui→03RobH This host is ready for DCOPs to take over. [07:29:56] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:38:11] (03PS2) 10Muehlenhoff: Mask the default uwsgi service on puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/519350 [07:42:22] (03CR) 10Muehlenhoff: [C: 03+2] Mask the default uwsgi service on puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/519350 (owner: 10Muehlenhoff) [07:42:46] (03PS2) 10Gehel: wdqs: publish full MDC in file based logs. [puppet] - 10https://gerrit.wikimedia.org/r/519046 [07:43:20] (03CR) 10Gehel: [C: 03+2] wdqs: publish full MDC in file based logs. [puppet] - 10https://gerrit.wikimedia.org/r/519046 (owner: 10Gehel) [07:45:54] RECOVERY - DPKG on puppetboard2001 is OK: All packages OK [07:47:18] RECOVERY - Check systemd state on puppetboard2001 is OK: OK - running: The system is fully operational [07:47:38] RECOVERY - puppet last run on puppetboard2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:50:29] (03PS1) 10Elukey: cdh::oozie: add hive2/hcat credentials classes [puppet] - 10https://gerrit.wikimedia.org/r/519355 (https://phabricator.wikimedia.org/T212259) [07:50:41] thanks for fixing it moritzm! [07:51:16] (03CR) 10Elukey: [C: 03+2] cdh::oozie: add hive2/hcat credentials classes [puppet] - 10https://gerrit.wikimedia.org/r/519355 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [07:51:55] (03PS1) 10Gehel: wdqs: fix pattern in log configuration [puppet] - 10https://gerrit.wikimedia.org/r/519356 [07:52:15] (03PS2) 10Gehel: wdqs: fix pattern in log configuration [puppet] - 10https://gerrit.wikimedia.org/r/519356 [07:52:48] 10Operations, 10Community-Relations, 10Traffic, 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Zache) Reported again in [[ https://fi.wikipedia.org/wiki/W... [07:52:53] (03CR) 10Gehel: [C: 03+2] wdqs: fix pattern in log configuration [puppet] - 10https://gerrit.wikimedia.org/r/519356 (owner: 10Gehel) [07:58:24] 10Operations, 10Community-Relations, 10Traffic, 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10ArielGlenn) Can we get approximate times for these last rep... [08:14:30] 10Operations, 10Community-Relations, 10Traffic, 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Ejs-80) @ArielGlenn, two fiwiki users reported about this b... [08:14:36] just passing by to wish you guys a great day. [08:20:30] 10Operations, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10MoritzMuehlenhoff) [08:20:59] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10ema) Just bumped into another very frequent one: ` 08:18:54 ema@cp5005.eqsin.wmnet:~... [08:21:21] (03PS4) 10Arturo Borrero Gonzalez: toolforge: configure kubernetes node using TLS instead of token auth [puppet] - 10https://gerrit.wikimedia.org/r/519259 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [08:21:43] 10Operations, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10Gilles) a:05Gilles→03ayounsi [08:26:18] (03PS4) 10Gehel: icinga: fix zero division error for mjolnir bulk update alert [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) (owner: 10Mathew.onipe) [08:27:49] (03CR) 10Gehel: [C: 03+2] icinga: fix zero division error for mjolnir bulk update alert [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) (owner: 10Mathew.onipe) [08:28:08] onimisionipe: ^ I'll let you check [08:31:27] (03PS5) 10Arturo Borrero Gonzalez: toolforge: configure kubernetes node using TLS instead of token auth [puppet] - 10https://gerrit.wikimedia.org/r/519259 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [08:33:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: configure kubernetes node using TLS instead of token auth [puppet] - 10https://gerrit.wikimedia.org/r/519259 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [08:44:02] 10Operations, 10Community-Relations, 10Traffic, 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10ArielGlenn) The 03-03:35 incidents were likely related to n... [08:49:29] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: avoid saturday [puppet] - 10https://gerrit.wikimedia.org/r/519361 (https://phabricator.wikimedia.org/T226035) [08:54:36] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_purge: avoid saturday [puppet] - 10https://gerrit.wikimedia.org/r/519361 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [08:56:48] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Gilles) I've seen this one stuck in poolcounter throttling for a while, it's definite... [08:58:45] (03PS7) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) [09:06:00] (03PS1) 10Hashar: Add git buildpackage configuration [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519363 [09:08:57] (03PS1) 10Gilles: Renew origin trial tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519364 [09:10:25] (03CR) 10Hashar: "The CI job uses git buildpackage under the hood, seems that made the build work!" [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519363 (owner: 10Hashar) [09:11:57] (03CR) 10Gilles: [C: 03+2] Renew origin trial tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519364 (owner: 10Gilles) [09:13:09] (03Merged) 10jenkins-bot: Renew origin trial tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519364 (owner: 10Gilles) [09:13:24] (03CR) 10jenkins-bot: Renew origin trial tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519364 (owner: 10Gilles) [09:17:27] !log rollback AMS-IX special routing [09:19:49] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27081 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [09:20:58] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Renew origin trial tokens (duration: 00m 59s) [09:25:36] elastic1025 is moving a commonswiki shard away, it should recover in a few [09:28:51] 10Operations, 10observability, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10Peachey88) [09:32:38] (03PS1) 10Elukey: Add more granularity to query/time|size buckets [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/519365 (https://phabricator.wikimedia.org/T226035) [09:34:17] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [09:35:27] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220104 (10fgiunchedi) [09:41:13] 10Operations, 10Wikimedia-General-or-Unknown: Request for information about hosting services for WM-ES - https://phabricator.wikimedia.org/T211414 (10jcrespo) Dzhan and others answered at T211414#4822356 T211414#4805585, is that enough/does that answer your questions? I suggest if you need further information... [09:44:45] 10Operations, 10Traffic: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10akosiaris) 05Open→03Stalled p:05Normal→03Low OK, good to know. Moving to Low priority and Stalled status until then. [09:47:11] 08Warning Alert for device cr2-esams.wikimedia.org - Memory over 85% [09:47:24] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe) [09:50:03] will look into ^ [09:50:41] nothing urgent, it's at 86%, slowly raising since a very long time ago [10:02:18] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10jbond) @Nuria I have checked with moritz and cn=wmf should be all that is required for access to turnilo. @ACraze I have checked the logs on... [10:02:27] PROBLEM - HHVM rendering on mw1320 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:02:31] PROBLEM - Nginx local proxy to apache on mw1320 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:03:55] RECOVERY - HHVM rendering on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 74935 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:03:59] RECOVERY - Nginx local proxy to apache on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:04:35] (03PS1) 10Revi: Add Portal Namespace to VisualEditor option on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519367 (https://phabricator.wikimedia.org/T224813) [10:05:32] (03CR) 10Effie Mouzeli: [C: 03+2] Have the Swift rewrite proxy renew expiry headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [10:06:27] (03PS6) 10Gilles: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) [10:06:45] (03CR) 10Effie Mouzeli: "used +2 instead of +1, sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [10:07:28] (03CR) 10Effie Mouzeli: [C: 03+2] Increase swift proxy connection timeout to 1s [puppet] - 10https://gerrit.wikimedia.org/r/518658 (https://phabricator.wikimedia.org/T226373) (owner: 10Gilles) [10:07:37] (03PS2) 10Effie Mouzeli: Increase swift proxy connection timeout to 1s [puppet] - 10https://gerrit.wikimedia.org/r/518658 (https://phabricator.wikimedia.org/T226373) (owner: 10Gilles) [10:11:39] (03Abandoned) 10MarcoAurelio: DNM JENKINS TEST [debs/file-read-backwards] - 10https://gerrit.wikimedia.org/r/519209 (owner: 10MarcoAurelio) [10:16:08] (03CR) 10Effie Mouzeli: [C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/17144/ms-fe1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [10:16:20] (03CR) 10Effie Mouzeli: [C: 03+2] Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [10:16:30] (03PS7) 10Effie Mouzeli: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [10:21:29] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Qgil) [10:21:33] <_joe_> gehel: I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/518705 [10:22:33] 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10jbond) @spatton i dont see Camille de Nes on either [[https://office.wikimedia.org/wiki/Contact_list | their c... [10:23:52] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::ui: configure hive for hue [puppet] - 10https://gerrit.wikimedia.org/r/519368 (https://phabricator.wikimedia.org/T212259) [10:23:56] (03PS2) 10Giuseppe Lavagetto: lvs::configuration: use a meaningful request to monitor wdqs [puppet] - 10https://gerrit.wikimedia.org/r/518705 [10:24:22] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::ui: configure hive for hue [puppet] - 10https://gerrit.wikimedia.org/r/519368 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [10:25:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "I thought this had already been merged, I'll take care of deploying it carefully." [puppet] - 10https://gerrit.wikimedia.org/r/518705 (owner: 10Giuseppe Lavagetto) [10:25:46] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [10:25:54] <_joe_> sigh [10:25:55] <_joe_> again [10:26:03] 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10MoritzMuehlenhoff) @jbond: These staff pages are often slow get updated by T&C (or whoever keeps them updated),... [10:26:05] (03PS3) 10Giuseppe Lavagetto: lvs::configuration: use a meaningful request to monitor wdqs [puppet] - 10https://gerrit.wikimedia.org/r/518705 [10:28:33] <_joe_> !log progressively restarting pybal in codfw, eqiad to pick up the change in monitoring for wdqs [10:28:34] stashbot is missing [10:28:41] _joe_: will not be logged [10:28:52] * _joe_ shrugs [10:29:41] !log-not-log restarting tcpircbot-logmsgbot on icinga1001 [10:29:58] <_joe_> !log progressively restarting pybal in codfw, eqiad to pick up the change in monitoring for wdqs [10:30:07] !log restarted tcpircbot-logmsgbot on icinga1001, was not !log-ing since 01:11 UTC this morning [10:30:11] <_joe_> thanks volans [10:30:14] <_joe_> if it works [10:30:16] let's see if it works first [10:30:19] <_joe_> it doesn't seem to [10:31:22] :( [10:32:22] last logged line is from freenode-connect Welcome to freenode [10:34:46] 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Elitre) Sorry for asking. Is this related to T226048? [10:35:44] !log updated buster d-i image to release candidate 2 [10:36:36] 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10jbond) 05Open→03Resolved a:03jbond >>! In T226614#5288539, @MoritzMuehlenhoff wrote: > @jbond: These staf... [10:36:59] I'm totally fried by the heat... the wikiteck page redirected me, I restarted the wrong bot, on it [10:38:49] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) >>! In T226048#5279710, @Kri... [10:39:46] !log restarted stashbot on toolforge was not !log-ing since 01:11 UTC this morning [10:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:53] yay, it's back [10:40:10] <_joe_> !log progressively restarting pybal in codfw, eqiad to pick up the change in monitoring for wdqs [10:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:28] _joe_: thanks! On lunch break, but scream if you need me [10:47:36] <_joe_> nah it's fine [10:47:49] (03PS1) 10Alexandros Kosiaris: Add data types to k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519369 [10:47:51] (03PS1) 10Alexandros Kosiaris: Use more specific data types in k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519370 [10:48:02] !log updated buster d-i image to release candidate 2 [10:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:30] !log Rolling restart ms-fe* proxy services for T226373 and T211661 [10:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:37] T226373: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 [10:48:37] T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 [10:54:23] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::ui: add client config [puppet] - 10https://gerrit.wikimedia.org/r/519373 (https://phabricator.wikimedia.org/T226698) [10:57:24] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10jbond) @Jpita I can see your account in OID's LDAP ` uid: jpita-ctr mail: jpita-ctr@wikimedia.org ` however the jpita developer account i see is registered to a gmail address. You wil... [10:57:41] 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10ema) 05Open→03Resolved a:03ema This 503 error was due to network issues in eqiad as mentioned by @Marostegui and @Antigng. [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T1100). [11:00:04] alaa_wmde, Urbanecm, and revi: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] hoi~hoi [11:01:15] :O [11:03:59] I'll be around till next hour so if someone catches me before the slot time runs out... that's fine to me [11:05:18] 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Gilles) Error rate hasn't gone down at all, now we're just getting errors that time out at 1s instead of 0.5s... ` Jun 27 11:0... [11:09:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While this change seems correct, given how puppet works we need to break this down in different, progressive changes. Specifically:" [puppet] - 10https://gerrit.wikimedia.org/r/514226 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [11:09:59] (03CR) 10Gilles: Have the Swift rewrite proxy renew expiry headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [11:10:43] (03CR) 10Arturo Borrero Gonzalez: nova-compute: use puppet certs for libvirt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [11:14:50] Amir1 or Urbanecm: are you there? [11:15:17] I'm around. [11:15:25] I can deploy yours [11:15:28] :) [11:15:59] 10Operations, 10observability, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10jbond) p:05Triage→03Normal [11:16:13] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519367 (https://phabricator.wikimedia.org/T224813) (owner: 10Revi) [11:17:14] (03Merged) 10jenkins-bot: Add Portal Namespace to VisualEditor option on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519367 (https://phabricator.wikimedia.org/T224813) (owner: 10Revi) [11:17:29] (03CR) 10jenkins-bot: Add Portal Namespace to VisualEditor option on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519367 (https://phabricator.wikimedia.org/T224813) (owner: 10Revi) [11:17:41] (03CR) 10Ladsgroup: "We haven't finished migrating all properties on test wikidata to the new term store. This has to be done first, otherwise it'll lack value" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519211 (https://phabricator.wikimedia.org/T225053) (owner: 10Alaa Sarhan) [11:18:19] revi: It's live on mwdebug1002 [11:18:25] {{doing}} [11:19:16] (03PS1) 10Gilles: Only apply expiry logic to "thumb" zone [puppet] - 10https://gerrit.wikimedia.org/r/519374 (https://phabricator.wikimedia.org/T211661) [11:19:24] {{confirmed}} [11:19:29] Urbanecm: around for your deployment? [11:19:58] (03CR) 10Gilles: Have the Swift rewrite proxy renew expiry headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [11:20:30] revi: going live [11:21:18] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:519367|Add Portal Namespace to VisualEditor option on kowiki (T224813)]] (duration: 00m 57s) [11:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:23] T224813: Enable VisualEditor for Portal Namespace on Korean Wikipedia - https://phabricator.wikimedia.org/T224813 [11:21:25] revi: ^ [11:21:29] kk [11:22:16] Verified +2 [11:22:39] awesome Amir1 :D [11:23:10] \o/ [11:23:18] !log EU SWAT is done [11:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:35] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Jpita) @jbond https://wikitech.wikimedia.org/wiki/User:Jose_pita is that ok? [11:28:39] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10jbond) 05Open→03Resolved a:03jbond @Jpita thanks i have added that account to the wmf group, you should be able to login to logstash now, please re-open if you are still having prob... [11:29:37] 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10jijiki) [11:32:36] 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Jpita) @jbond it works, thanks for the help [11:34:57] (03CR) 10KartikMistry: [C: 03+1] Don't show cannot publish error to 'sysop' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic) [11:35:51] (03PS1) 10Arturo Borrero Gonzalez: k8s: kubelet: replace require with a warning [puppet] - 10https://gerrit.wikimedia.org/r/519375 (https://phabricator.wikimedia.org/T215531) [11:38:11] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki) @awight I will rollout the new version to production today [11:44:31] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: node: fix template path [puppet] - 10https://gerrit.wikimedia.org/r/519376 (https://phabricator.wikimedia.org/T215531) [11:45:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: node: fix template path [puppet] - 10https://gerrit.wikimedia.org/r/519376 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [12:00:57] 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Joe) Do we have metrics on the swift backends open connections / connections queues? without such information,... [12:01:43] !log start of mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=testwikidatawiki --batch-size=100 --sleep=3 (T225052) [12:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:49] T225052: Run Property Terms Rebuild script - https://phabricator.wikimedia.org/T225052 [12:01:58] (03PS1) 10Gilles: Serve JPG when WEBP conversion fails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) [12:04:02] 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10jijiki) @Joe I will start a more thorough investigation the following days, we'll see what will come up [12:08:39] (03CR) 10Effie Mouzeli: [C: 03+1] Serve JPG when WEBP conversion fails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles) [12:10:37] (03CR) 10Jbond: [C: 03+2] icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519240 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [12:10:46] (03PS2) 10Jbond: icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519240 (https://phabricator.wikimedia.org/T226508) [12:17:05] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Trizek-WMF) >>! In T226048#5288560,... [12:17:24] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) 05Open→03Resolved a:03Joe The immediate problem seems to be resolved given we've not see corrup... [12:28:00] (03CR) 10Jbond: [C: 03+2] icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519234 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [12:28:08] (03PS5) 10Jbond: icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519234 (https://phabricator.wikimedia.org/T226508) [12:36:34] (03PS3) 10Jbond: missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 [12:37:11] (03PS2) 10Arturo Borrero Gonzalez: k8s: kubelet: stop requiring ::k8s::infrastructure_config [puppet] - 10https://gerrit.wikimedia.org/r/519375 (https://phabricator.wikimedia.org/T215531) [12:40:23] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) @Marostegui, which wikis are affected? Only English Wikipedia? Do you need to display a banner too? [12:41:11] (03PS11) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [12:41:28] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) >>! In T226358#5288891, @Trizek-WMF wrote: > @Marostegui, which wikis are affected? Only English Wikipedia? > Do you nee... [12:42:11] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Memory over 85% [12:42:14] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) Thank you! :) [12:42:22] (03CR) 10Jbond: icinga: Add a script to parse and query the status.dat file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [12:47:24] 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10spatton) Thanks @jbond and @MoritzMuehlenhoff! [12:47:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, specially because we are about to get rid of Jessie. But I wonder if we will have to undo this factorization again when we reach Bus" [puppet] - 10https://gerrit.wikimedia.org/r/519268 (owner: 10Andrew Bogott) [12:56:06] (03CR) 10Volans: [C: 03+1] "LGTM, last call for an external reviewer (given both John and me wrote this)" [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [13:00:22] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::ui: add client config [puppet] - 10https://gerrit.wikimedia.org/r/519373 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [13:00:29] (03PS2) 10Elukey: role::analytics_test_cluster::hadoop::ui: add client config [puppet] - 10https://gerrit.wikimedia.org/r/519373 (https://phabricator.wikimedia.org/T226698) [13:02:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/519374 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [13:04:51] (03CR) 10Filippo Giunchedi: Have the Swift rewrite proxy renew expiry headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [13:06:53] (03PS1) 10Elukey: Move analytics client profile from hue to druid [puppet] - 10https://gerrit.wikimedia.org/r/519390 (https://phabricator.wikimedia.org/T226698) [13:07:21] (03CR) 10Filippo Giunchedi: "LGTM, nonblocking nit inline, feel free to ignore" (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles) [13:07:53] (03CR) 10Elukey: [C: 03+2] Move analytics client profile from hue to druid [puppet] - 10https://gerrit.wikimedia.org/r/519390 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [13:11:48] !log depool restbase10(0[7-9]|1[0-5]) before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/513262 [13:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:42] (03PS1) 10Elukey: role::druid::test_analytics::worker: add hive config [puppet] - 10https://gerrit.wikimedia.org/r/519393 (https://phabricator.wikimedia.org/T226698) [13:13:07] (03PS3) 10Filippo Giunchedi: RESTBase: Remove restbase10(0[7-9]|1[0-5]) and set them as spares [puppet] - 10https://gerrit.wikimedia.org/r/513262 (https://phabricator.wikimedia.org/T223976) (owner: 10Mobrovac) [13:14:21] (03CR) 10Filippo Giunchedi: [C: 03+2] RESTBase: Remove restbase10(0[7-9]|1[0-5]) and set them as spares [puppet] - 10https://gerrit.wikimedia.org/r/513262 (https://phabricator.wikimedia.org/T223976) (owner: 10Mobrovac) [13:15:18] !log start druid drop datasource test - might affect AQS - T226035 [13:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:23] T226035: Dropping data from druid takes down aqs hosts - https://phabricator.wikimedia.org/T226035 [13:21:52] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: add hive config [puppet] - 10https://gerrit.wikimedia.org/r/519393 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [13:22:01] (03PS2) 10Elukey: role::druid::test_analytics::worker: add hive config [puppet] - 10https://gerrit.wikimedia.org/r/519393 (https://phabricator.wikimedia.org/T226698) [13:24:15] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10fgiunchedi) [13:26:33] (03PS4) 10Lucas Werkmeister (WMDE): dologmsg: add manpage [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) [13:26:44] !log push RPKI classification test to cr4-ulsfo - T220669 [13:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:49] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [13:28:20] (03PS1) 10Elukey: role::druid::test_analytics::worker: add other hadoop client config [puppet] - 10https://gerrit.wikimedia.org/r/519395 (https://phabricator.wikimedia.org/T226698) [13:28:48] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: add other hadoop client config [puppet] - 10https://gerrit.wikimedia.org/r/519395 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [13:29:51] (03CR) 10Lucas Werkmeister (WMDE): "I tried to rebase it, but someone should definitely test it with the puppet compiler, I’m not sure if the paths are still correct. (I trie" [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [13:30:47] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) Banner set. It will be displayed starting at 05:00 UTC July 3 on all wikis. End at 06:20 UTC. [13:31:30] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) Thank you! [13:34:06] PROBLEM - Restbase root url on restbase1010 is CRITICAL: connect to address 10.64.0.112 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:34:18] PROBLEM - Restbase root url on restbase1011 is CRITICAL: connect to address 10.64.0.113 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:34:22] PROBLEM - Restbase root url on restbase1012 is CRITICAL: connect to address 10.64.32.79 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:34:34] PROBLEM - Restbase root url on restbase1014 is CRITICAL: connect to address 10.64.48.133 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:34:40] PROBLEM - Restbase root url on restbase1013 is CRITICAL: connect to address 10.64.32.80 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:34:40] PROBLEM - Restbase root url on restbase1007 is CRITICAL: connect to address 10.64.0.223 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:34:48] PROBLEM - Restbase root url on restbase1009 is CRITICAL: connect to address 10.64.48.110 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:35:00] PROBLEM - Restbase root url on restbase1008 is CRITICAL: connect to address 10.64.32.178 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:35:00] PROBLEM - Restbase root url on restbase1015 is CRITICAL: connect to address 10.64.48.134 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [13:35:07] mobrovac: --^ [13:35:25] anything ongoing at the moment? [13:35:32] known elukey, these are being decommed [13:35:42] fiuuuu [13:35:44] thanks :) [13:35:46] :) [13:35:48] i'll ack [13:36:40] (03PS2) 10Alexandros Kosiaris: Add data types to k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519369 [13:36:42] (03PS2) 10Alexandros Kosiaris: Use more specific data types in k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519370 [13:36:44] (03PS1) 10Alexandros Kosiaris: kubernetes: Move k8s::infrastructure_config to profile [puppet] - 10https://gerrit.wikimedia.org/r/519398 [13:38:37] uugh thanks, I guess I was too eager [13:38:42] running puppet on icinga [13:39:02] the decom cookbook does that, was it not run? [13:39:18] nobody likes cookbooks! [13:39:21] * elukey runs away [13:40:27] volans: which one? I followed https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Remove_from_production [13:40:58] ah is still in the phase of getting out of prod, not decom yet [13:41:41] (03PS1) 10Elukey: role::druid::test_analytics::worker: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/519400 (https://phabricator.wikimedia.org/T226698) [13:41:48] it's mentioned later on [13:42:04] and we have a bunch of improvements coming up as a follow up of a session at the sre summit [13:42:23] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/519400 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [13:43:05] !log push RPKI classification test to cr3-ulsfo - T220669 [13:43:06] that's awesome [13:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:10] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [13:43:41] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10mobrovac) [13:44:24] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10mobrovac) 05Open→03Resolved a:03mobrovac [13:46:28] (03PS1) 10Ema: cache: double appservers connection limit [puppet] - 10https://gerrit.wikimedia.org/r/519401 [13:51:11] (03PS2) 10Ema: cache: double appservers and api connection limit [puppet] - 10https://gerrit.wikimedia.org/r/519401 [13:51:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17146/ says noop in production and is effectively a noop, some merging" [puppet] - 10https://gerrit.wikimedia.org/r/519369 (owner: 10Alexandros Kosiaris) [13:52:10] (03PS3) 10Alexandros Kosiaris: Add data types to k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519369 [13:52:12] (03PS3) 10Alexandros Kosiaris: Use more specific data types in k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519370 [13:52:14] (03PS2) 10Alexandros Kosiaris: kubernetes: Move k8s::infrastructure_config to profile [puppet] - 10https://gerrit.wikimedia.org/r/519398 [13:55:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17146/ says noop in production and is effectively a noop, merging" [puppet] - 10https://gerrit.wikimedia.org/r/519370 (owner: 10Alexandros Kosiaris) [14:05:56] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) All our San Francisco POP now have a `validation-state` on its received prefixes. Next step is to push it to all the sites. [14:07:02] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [14:07:17] (03PS1) 10Ema: cache: reimage cp2002 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519404 (https://phabricator.wikimedia.org/T226637) [14:08:48] (03PS1) 10Elukey: role::druid::test_analytics::worker: remove hadoop client config [puppet] - 10https://gerrit.wikimedia.org/r/519405 (https://phabricator.wikimedia.org/T226698) [14:09:40] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: remove hadoop client config [puppet] - 10https://gerrit.wikimedia.org/r/519405 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [14:10:08] (03CR) 10Andrew Bogott: [C: 03+2] "As I was writing this I thought, "Hope we don't have to revert all this for Buster!" Time will tell :/" [puppet] - 10https://gerrit.wikimedia.org/r/519268 (owner: 10Andrew Bogott) [14:10:16] (03PS5) 10Andrew Bogott: nova-compute: consolidate a bunch of code that isn't distro-specific [puppet] - 10https://gerrit.wikimedia.org/r/519268 [14:10:55] (03CR) 10Jbond: [C: 03+2] icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [14:11:03] (03PS12) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [14:11:32] !log depool cp2002 and reimage as upload_ats T226637 [14:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:37] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [14:12:03] (03PS6) 10Andrew Bogott: nova-compute: consolidate a bunch of code that isn't distro-specific [puppet] - 10https://gerrit.wikimedia.org/r/519268 [14:13:01] (03PS2) 10Ema: cache: reimage cp2002 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519404 (https://phabricator.wikimedia.org/T226637) [14:13:52] !log push RPKI classification test to eqord - T220669 [14:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:57] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [14:13:58] (03CR) 10Ema: [C: 03+2] cache: reimage cp2002 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519404 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [14:16:18] (03PS7) 10Andrew Bogott: nova-compute: consolidate a bunch of code that isn't distro-specific [puppet] - 10https://gerrit.wikimedia.org/r/519268 [14:16:19] (03PS3) 10Andrew Bogott: nova-compute: remove libvirt_type params [puppet] - 10https://gerrit.wikimedia.org/r/519275 [14:16:22] (03PS13) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) [14:16:23] (03PS2) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315 [14:17:47] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2002.codfw.wmnet'] ` The log can be found in `... [14:19:27] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: remove libvirt_type params [puppet] - 10https://gerrit.wikimedia.org/r/519275 (owner: 10Andrew Bogott) [14:23:58] !log running `mwscript extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --wiki=commonswiki --audio --missing --throttle` in screen as me on mwmaint1002 T226713 [14:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:03] T226713: Run cleanupTranscodes.php for current midi files - https://phabricator.wikimedia.org/T226713 [14:24:47] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/compiler1002/17148/ says noop" [puppet] - 10https://gerrit.wikimedia.org/r/519398 (owner: 10Alexandros Kosiaris) [14:28:17] !log push RPKI classification to Dallas - T220669 [14:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:22] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [14:28:44] (03PS1) 10Elukey: role::druid::test_analytics::worker: enable kerberos [puppet] - 10https://gerrit.wikimedia.org/r/519408 (https://phabricator.wikimedia.org/T226698) [14:28:49] 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10fgiunchedi) >>! In T226373#5288615, @Gilles wrote: > Error rate hasn't gone down at all, now we're just gettin... [14:28:59] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [14:29:24] (03PS2) 10Elukey: role::druid::test_analytics::worker: enable kerberos [puppet] - 10https://gerrit.wikimedia.org/r/519408 (https://phabricator.wikimedia.org/T226698) [14:30:10] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: enable kerberos [puppet] - 10https://gerrit.wikimedia.org/r/519408 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [14:30:53] (03CR) 10Herron: "akosiaris got time for one more calico update?" [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [14:32:34] (03PS1) 10Cwhite: grafana: remove legacy varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/519410 (https://phabricator.wikimedia.org/T184942) [14:32:58] (03CR) 10Alexandros Kosiaris: "> akosiaris got time for one more calico update?" [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [14:33:17] !log push newer calico outgoing policy rules. T225005 [14:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:21] T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 [14:33:31] thank you! [14:33:40] (03PS3) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315 [14:33:42] (03PS1) 10Andrew Bogott: nova-compute: fix dependency for cacert.pem [puppet] - 10https://gerrit.wikimedia.org/r/519411 (https://phabricator.wikimedia.org/T225484) [14:33:46] yw [14:33:55] PROBLEM - puppet last run on cloudvirt1025 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:34:24] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: fix dependency for cacert.pem [puppet] - 10https://gerrit.wikimedia.org/r/519411 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [14:35:35] PROBLEM - puppet last run on cloudvirt1019 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:36:32] (03PS4) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315 [14:36:34] (03PS1) 10Andrew Bogott: nova-compute: fix dependency for cacert.pem [puppet] - 10https://gerrit.wikimedia.org/r/519412 (https://phabricator.wikimedia.org/T225484) [14:37:24] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: fix dependency for cacert.pem [puppet] - 10https://gerrit.wikimedia.org/r/519412 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott) [14:37:53] (03PS1) 10Ema: cache_upload codfw: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/519414 (https://phabricator.wikimedia.org/T226637) [14:38:25] PROBLEM - puppet last run on cloudvirt1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:39:19] RECOVERY - puppet last run on cloudvirt1025 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [14:40:39] PROBLEM - puppet last run on cloudvirt1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:40:57] PROBLEM - puppet last run on cloudvirtan1005 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:42:11] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:43:00] (03PS2) 10Herron: kafka-main2001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519021 [14:43:20] !log beginning replacement of kafka2001 with kafka-main2001 T225005 [14:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:26] T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 [14:46:00] (03CR) 10Herron: [C: 03+2] kafka-main2001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519021 (owner: 10Herron) [14:46:03] PROBLEM - puppet last run on cloudvirt1008 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:48:18] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2002.codfw.wmnet'] ` The log can be found in `... [14:53:33] (03PS1) 10Alexandros Kosiaris: k8s: Make $extra_params type an array of strings [puppet] - 10https://gerrit.wikimedia.org/r/519418 [14:55:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Make $extra_params type an array of strings [puppet] - 10https://gerrit.wikimedia.org/r/519418 (owner: 10Alexandros Kosiaris) [15:03:48] RECOVERY - puppet last run on cloudvirt1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:05:13] (03PS1) 10Ema: Revert "install_server: Do not re-image db1133 and dbproxy" [puppet] - 10https://gerrit.wikimedia.org/r/519419 [15:05:22] (03PS1) 10Filippo Giunchedi: install_server: use buster for centrallog1001 [puppet] - 10https://gerrit.wikimedia.org/r/519420 (https://phabricator.wikimedia.org/T200706) [15:05:26] (03PS1) 10Alexandros Kosiaris: Revert "install_server: Do not re-image db1133 and dbproxy" [puppet] - 10https://gerrit.wikimedia.org/r/519421 [15:05:57] (03PS2) 10Alexandros Kosiaris: Revert "install_server: Do not re-image db1133 and dbproxy" [puppet] - 10https://gerrit.wikimedia.org/r/519421 [15:06:20] RECOVERY - puppet last run on cloudvirt1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:06:28] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Development services), and 3 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10MoritzMuehlenhoff) @mmodell I've built an interim package on our package... [15:06:40] (03PS1) 10Elukey: druid: ensure that the druid user is in the pupept catalog [puppet] - 10https://gerrit.wikimedia.org/r/519422 (https://phabricator.wikimedia.org/T226698) [15:06:40] RECOVERY - puppet last run on cloudvirtan1005 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:06:40] RECOVERY - puppet last run on cloudvirt1019 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:06:50] (03CR) 10Urbanecm: [C: 03+2] "Per Trizek's request, beta only, so it can go out at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan) [15:07:05] (03PS3) 10Urbanecm: Betalabs: Enable GrowthExperiments features for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan) [15:07:08] (03PS2) 10Herron: kafka-main: replace kafka2001 hardware with kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005) [15:07:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "install_server: Do not re-image db1133 and dbproxy" [puppet] - 10https://gerrit.wikimedia.org/r/519421 (owner: 10Alexandros Kosiaris) [15:07:18] (03CR) 10Urbanecm: [C: 03+2] "Per Trizek's request, beta only, so it can go out at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan) [15:08:15] (03Merged) 10jenkins-bot: Betalabs: Enable GrowthExperiments features for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan) [15:08:17] (03CR) 10Ottomata: [C: 03+1] druid: ensure that the druid user is in the pupept catalog [puppet] - 10https://gerrit.wikimedia.org/r/519422 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [15:08:30] (03CR) 10jenkins-bot: Betalabs: Enable GrowthExperiments features for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan) [15:08:42] (03CR) 10Herron: [C: 03+2] kafka-main: replace kafka2001 hardware with kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:08:49] (03PS3) 10Herron: kafka-main: replace kafka2001 hardware with kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005) [15:09:05] (03PS2) 10Elukey: druid: ensure that the druid user is in the pupept catalog [puppet] - 10https://gerrit.wikimedia.org/r/519422 (https://phabricator.wikimedia.org/T226698) [15:09:11] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T223825 (10jijiki) [15:09:15] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10jijiki) [15:09:36] (03CR) 10Elukey: [C: 03+2] druid: ensure that the druid user is in the pupept catalog [puppet] - 10https://gerrit.wikimedia.org/r/519422 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [15:10:33] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) To clarify, Trizek isn't ask... [15:11:23] (03PS4) 10Herron: kafka-main: replace kafka2001 hardware with kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005) [15:11:55] (03PS5) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315 [15:11:58] RECOVERY - puppet last run on cloudvirt1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:12:27] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2002.codfw.wmnet'] ` The log can be found in `... [15:12:54] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use buster for centrallog1001 [puppet] - 10https://gerrit.wikimedia.org/r/519420 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [15:13:02] (03PS2) 10Filippo Giunchedi: install_server: use buster for centrallog1001 [puppet] - 10https://gerrit.wikimedia.org/r/519420 (https://phabricator.wikimedia.org/T200706) [15:13:12] (03PS6) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315 [15:14:01] (03CR) 10Andrew Bogott: [C: 03+2] libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315 (owner: 10Andrew Bogott) [15:16:14] (03PS3) 10Filippo Giunchedi: install_server: use buster for centrallog1001 [puppet] - 10https://gerrit.wikimedia.org/r/519420 (https://phabricator.wikimedia.org/T200706) [15:24:20] (03PS1) 10ArielGlenn: tiny bit more error reporting when we do stubs/pagelogs/abstracts [dumps] - 10https://gerrit.wikimedia.org/r/519427 (https://phabricator.wikimedia.org/T226659) [15:26:06] 10Operations, 10Space, 10Wikimedia-Mailing-lists: Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) [15:26:45] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) [15:34:09] (03PS1) 10Muehlenhoff: Record extended MOU for nathante [puppet] - 10https://gerrit.wikimedia.org/r/519429 [15:37:10] (03PS1) 10Andrew Bogott: libvirt: remove some source files that aren't actually installed [puppet] - 10https://gerrit.wikimedia.org/r/519431 [15:37:37] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Qgil) Sorry, it took more time than I expected when I posted the comment above, but here it is: {T226727} & https://discuss-space.wmflabs.org/t/inte... [15:39:39] (03PS2) 10Andrew Bogott: libvirt: remove some source files that aren't actually installed [puppet] - 10https://gerrit.wikimedia.org/r/519431 [15:40:39] (03CR) 10Andrew Bogott: [C: 03+2] libvirt: remove some source files that aren't actually installed [puppet] - 10https://gerrit.wikimedia.org/r/519431 (owner: 10Andrew Bogott) [15:40:50] (03PS1) 10BPirkle: Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099) [15:41:36] (03PS2) 10Muehlenhoff: Record extended MOU for nathante [puppet] - 10https://gerrit.wikimedia.org/r/519429 [15:41:43] (03CR) 10jerkins-bot: [V: 04-1] Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [15:43:08] (03CR) 10Muehlenhoff: [C: 03+2] Record extended MOU for nathante [puppet] - 10https://gerrit.wikimedia.org/r/519429 (owner: 10Muehlenhoff) [15:43:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:43:26] (03PS2) 10BPirkle: Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099) [15:44:13] (03CR) 10jerkins-bot: [V: 04-1] Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [15:44:30] (03PS1) 10Jbond: mailman: rename mailing list [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367) [15:45:17] (03CR) 10ArielGlenn: [C: 03+2] tiny bit more error reporting when we do stubs/pagelogs/abstracts [dumps] - 10https://gerrit.wikimedia.org/r/519427 (https://phabricator.wikimedia.org/T226659) (owner: 10ArielGlenn) [15:46:34] (03CR) 10Herron: [C: 03+1] "lgtm, one nitpick" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367) (owner: 10Jbond) [15:48:11] (03PS2) 10Jbond: mailman: rename mailing list [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367) [15:48:26] (03CR) 10Jbond: mailman: rename mailing list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367) (owner: 10Jbond) [15:48:34] (03PS3) 10Jbond: mailman: rename mailing list [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367) [15:49:33] (03CR) 10Jbond: [C: 03+2] mailman: rename mailing list [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367) (owner: 10Jbond) [15:50:16] (03PS1) 10CDanis: phaste: set a User-Agent in line with WMF policy [puppet] - 10https://gerrit.wikimedia.org/r/519435 [15:50:18] (03PS1) 10CDanis: phaste: add argument parsing and --title [puppet] - 10https://gerrit.wikimedia.org/r/519436 [15:50:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:53:50] (03PS2) 10Jbond: phaste: set a User-Agent in line with WMF policy [puppet] - 10https://gerrit.wikimedia.org/r/519435 (https://phabricator.wikimedia.org/T226508) (owner: 10CDanis) [15:54:04] (03CR) 10Jbond: "LGTM - added Bug: T226508" [puppet] - 10https://gerrit.wikimedia.org/r/519435 (https://phabricator.wikimedia.org/T226508) (owner: 10CDanis) [15:54:36] (03PS3) 10CDanis: phaste: set a User-Agent in line with WMF policy [puppet] - 10https://gerrit.wikimedia.org/r/519435 (https://phabricator.wikimedia.org/T226508) [15:54:48] (03CR) 10CDanis: [C: 03+2] phaste: set a User-Agent in line with WMF policy [puppet] - 10https://gerrit.wikimedia.org/r/519435 (https://phabricator.wikimedia.org/T226508) (owner: 10CDanis) [15:56:21] (03CR) 10Volans: phaste: set a User-Agent in line with WMF policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519435 (https://phabricator.wikimedia.org/T226508) (owner: 10CDanis) [15:57:42] (03PS2) 10CDanis: phaste: add argument parsing and --title [puppet] - 10https://gerrit.wikimedia.org/r/519436 [16:00:02] (03PS1) 10Elukey: role::druid::test_analytics::worker: fix some kerberos parameters [puppet] - 10https://gerrit.wikimedia.org/r/519439 (https://phabricator.wikimedia.org/T226698) [16:00:05] godog and _joe_: #bothumor I � Unicode. All rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:01:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/519436 (owner: 10CDanis) [16:01:34] (03CR) 10CDanis: [C: 03+2] phaste: add argument parsing and --title [puppet] - 10https://gerrit.wikimedia.org/r/519436 (owner: 10CDanis) [16:01:40] thanks jbond42 ! [16:01:53] no problem :) [16:02:22] (03PS2) 10Elukey: role::druid::test_analytics::worker: fix some kerberos parameters [puppet] - 10https://gerrit.wikimedia.org/r/519439 (https://phabricator.wikimedia.org/T226698) [16:04:15] thanks for allowing me to avoid publicly +1 that script :-P [16:04:24] (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: fix some kerberos parameters [puppet] - 10https://gerrit.wikimedia.org/r/519439 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [16:04:44] lol [16:05:23] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10jbond) @Varnent I have now renamed the old list and it is now available at https://lists.wikimedia.org/mailman/listinfo/MoveCom. The old U... [16:05:50] <[1997kB]> SSE stream, Error: broker transport failure Code: -195 errno: -195 [16:06:08] * cdanis doing the dirty work so volans doesn't have to <3 [16:06:36] <3 [16:08:44] (03PS1) 10Alexandros Kosiaris: ganeti partman: Switch to 100% of lvm guided size [puppet] - 10https://gerrit.wikimedia.org/r/519441 (https://phabricator.wikimedia.org/T224603) [16:10:12] (03Abandoned) 10Ema: Revert "install_server: Do not re-image db1133 and dbproxy" [puppet] - 10https://gerrit.wikimedia.org/r/519419 (owner: 10Ema) [16:10:31] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:12:26] (03PS1) 10Jcrespo: Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/519442 [16:13:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti partman: Switch to 100% of lvm guided size [puppet] - 10https://gerrit.wikimedia.org/r/519441 (https://phabricator.wikimedia.org/T224603) (owner: 10Alexandros Kosiaris) [16:16:42] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2002.codfw.wmnet'] ` and were **ALL** successful. [16:18:55] (03PS2) 10Ema: cache_upload codfw: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/519414 (https://phabricator.wikimedia.org/T226637) [16:22:22] (03CR) 10Ema: [C: 03+2] cache_upload codfw: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/519414 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [16:22:45] 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Gerrit admin permissions for Ottomata - https://phabricator.wikimedia.org/T226724 (10Jdforrester-WMF) [16:24:26] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T226569 (10Cmjohnson) 05Open→03Resolved @Marostegui disk swapped but this server is out of warranty. I would suggest moving masters to new servers. [16:24:43] (03PS2) 10Jcrespo: Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/519442 [16:29:33] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Cmjohnson) @Eevans Do you still want to move this server? Let's coordinate a day/time [16:31:50] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [16:36:56] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T226569 (10jcrespo) That's the plan. See: ` root@db1072:~$ megacli -PDList -aALL | grep rro Media Error Count: 0 Other Error Count: 0 Media Error Count: 0 Other Error Count: 3 Media Error Count: 0 Other Error Coun... [16:37:04] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2009.codfw.wmnet', 'ganeti2010.codfw.wmnet', 'ganeti... [16:38:04] (03CR) 10Jcrespo: [C: 03+2] Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/519442 (owner: 10Jcrespo) [16:39:11] !log pool cp2002 w/ ATS backend T226637 [16:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:17] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [16:41:08] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add stretch-wikimedia/component/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519449 (https://phabricator.wikimedia.org/T215975) [16:42:44] !log repool labsdb1011 T222978 [16:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:48] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [16:43:48] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: add stretch-wikimedia/component/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519449 (https://phabricator.wikimedia.org/T215975) [16:45:39] (03PS3) 10Arturo Borrero Gonzalez: aptrepo: add stretch-wikimedia/thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519449 (https://phabricator.wikimedia.org/T215975) [16:46:29] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10ACraze) Ahh ok, I'm able to get in to turnilo now, thanks! [16:53:50] (03CR) 10Cwhite: [C: 03+2] initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [16:55:10] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) @WMDE-leszek, @Tarrow. Any feedback on the comment above? [16:58:37] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Tarrow) @akosiaris Yep; we've interpreted it as something we really need before exposing it to real traffic. We've got a ticket open about... [17:00:04] cscott, arlolra, subbu, and halfak: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T1700). [17:04:20] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:06:49] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) >>! In T212189#5289724, @Tarrow wrote: > @akosiaris Yep; we've interpreted it as something we really need before exposing it to... [17:08:12] (03PS3) 10BPirkle: Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099) [17:09:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/519449 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez) [17:13:19] (03PS1) 10Volans: dbconfig: structure return values of actions [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 [17:13:21] (03PS1) 10Volans: dbconfig: config diff, on non-empty diff exit 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 [17:13:23] (03PS1) 10Volans: configuration: change IRC default values [software/conftool] - 10https://gerrit.wikimedia.org/r/519459 [17:13:25] (03PS1) 10Volans: dbconfig: allow to remote paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460 [17:13:27] (03PS1) 10Volans: dbconfig: improve config commit and restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519461 [17:17:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add stretch-wikimedia/thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519449 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez) [17:21:04] !log imported gpg keys 9DC858229FC7DD38854AE2D88D81803C0EBFCD88 and 54A647F9048D5688D7DA2ABE6A030B21BA07F4FB into install1002 for T215975 [17:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:10] T215975: Package/copy kubeadm, kubelet, docker-ce and kubectl to Toolforge Aptly or Reprepro - https://phabricator.wikimedia.org/T215975 [17:23:35] !log ppchelko@deploy1001 Started deploy [restbase/deploy@da50001]: Use new projects and new config layout T220855, rb2009 only [17:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:40] T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855 [17:25:52] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: fix wrong component name for thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519463 (https://phabricator.wikimedia.org/T215975) [17:26:13] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@da50001]: Use new projects and new config layout T220855, rb2009 only (duration: 02m 38s) [17:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:29] 10Operations, 10ops-eqiad, 10Analytics, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) [17:27:30] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:27:55] known ^ [17:28:16] RECOVERY - MegaRAID on db1072 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:28:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: fix wrong component name for thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519463 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez) [17:31:38] 10Operations, 10SRE-Access-Requests: please re-activate LDAP access for Dzahn - https://phabricator.wikimedia.org/T226744 (10Mutante) [17:34:42] restbase alert is known, the node's depooled [17:35:56] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10jbond) 05Open→03Resolved a:03jbond great, i think this is done now so closing please re open if there is still an issue [17:37:28] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2013.codfw.wmnet', 'ganeti2014.codfw.wmnet', 'ganeti2012.codfw.wmnet', 'ganeti2010.codfw.wmnet', 'ganeti2009.codfw.wmnet', 'gan... [17:42:18] (03CR) 10Urbanecm: [C: 04-1] "This depends on I8f18baa4b34318ac14e8c8e362ea59a1283c52c4, which is merged, but not in wmf.11. Deploying this before July 11 will make thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic) [17:45:10] 10Operations, 10SRE-Access-Requests: please re-activate LDAP access for Dzahn - https://phabricator.wikimedia.org/T226744 (10jbond) this should be fixed now reopen if more is needed or ping me on irc [17:45:21] 10Operations, 10SRE-Access-Requests: please re-activate LDAP access for Dzahn - https://phabricator.wikimedia.org/T226744 (10jbond) 05Open→03Resolved a:03jbond [17:47:36] (03CR) 10Petar.petkovic: "This can wait until July 11. We don't need to cherry pick I8f18baa4b34318ac14e8c8e362ea59a1283c52c4. Until then, sysops will see the warni" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic) [17:49:08] * Krinkle staging on mwdebug1002 [17:51:49] (03PS2) 10Urbanecm: Restrict uploading on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil) [17:53:55] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil) [17:55:11] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/WikimediaIncubator/includes/WikimediaIncubator.php: T204883 / 93643b44a52ea7 (duration: 01m 00s) [17:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:18] T204883: Incubator emits "PHP Notice: Undefined index: realtitle" - https://phabricator.wikimedia.org/T204883 [17:55:34] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: reprepro config file format fixes for thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519470 (https://phabricator.wikimedia.org/T215531) [17:56:08] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: reprepro config file format fixes for thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519470 (https://phabricator.wikimedia.org/T215975) [17:56:48] * Krinkle is done [17:57:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: reprepro config file format fixes for thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519470 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez) [17:58:17] 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Gerrit manager rights for Ottomata - https://phabricator.wikimedia.org/T226724 (10Ottomata) [17:59:11] !log ppchelko@deploy1001 Started deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, rb2009 only, fixed mathoid config [17:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:16] T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855 [18:00:04] MaxSem, RoanKattouw, and Niharika: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T1800). [18:00:04] Urbanecm: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:17] Let's do the needful then [18:00:42] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519257 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [18:00:43] jbond42: can you help with https://phabricator.wikimedia.org/T226724 ? [18:01:06] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:01:30] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, rb2009 only, fixed mathoid config (duration: 02m 19s) [18:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:49] (03CR) 10Urbanecm: [C: 04-2] "DNM, notes for personal usage" (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm) [18:03:12] (03PS3) 10Urbanecm: Revert "Revert "Set default aliases for Project_talk namespace"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519257 (https://phabricator.wikimedia.org/T173070) [18:03:24] !log ppchelko@deploy1001 Started deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, restbase1016 [18:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:32] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, restbase1016 (duration: 00m 08s) [18:03:36] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "gate pipeline succeeded, but failed to rebase => manual V+2 and submitting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519257 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [18:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:56] (03CR) 10jenkins-bot: Revert "Revert "Set default aliases for Project_talk namespace"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519257 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [18:04:54] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [18:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:00] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:05:01] !log ppchelko@deploy1001 Started deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, restbase1016 [18:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:04] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `db2039.codfw.wmnet` - db2039.codfw.wmnet - Removed from Puppet master and PuppetDB - Downt... [18:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:09] T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855 [18:05:21] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:519257|Revert "Revert "Set default aliases for Project_talk namespace""]] (T173070) (duration: 00m 57s) [18:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:27] T173070: Set default aliases for Project_talk namespace - https://phabricator.wikimedia.org/T173070 [18:06:42] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, restbase1016 (duration: 01m 41s) [18:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:58] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10RobH) [18:07:28] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: fix missing -e flag in grep-dctrl for thirparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519472 (https://phabricator.wikimedia.org/T215975) [18:08:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: fix missing -e flag in grep-dctrl for thirparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519472 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez) [18:08:17] !log running namespaceDupes.php across all wikis in tmux on mwmaint1002 (T173070) [18:08:21] (03PS1) 10RobH: decom db2039 [puppet] - 10https://gerrit.wikimedia.org/r/519473 (https://phabricator.wikimedia.org/T225988) [18:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:51] (03PS1) 10RobH: decom db2039 prod dns [dns] - 10https://gerrit.wikimedia.org/r/519474 (https://phabricator.wikimedia.org/T225988) [18:08:53] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil) [18:08:59] (03PS3) 10Urbanecm: Restrict uploading on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil) [18:09:02] (03CR) 10RobH: [C: 03+2] decom db2039 [puppet] - 10https://gerrit.wikimedia.org/r/519473 (https://phabricator.wikimedia.org/T225988) (owner: 10RobH) [18:09:06] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil) [18:09:11] (03PS2) 10RobH: decom db2039 [puppet] - 10https://gerrit.wikimedia.org/r/519473 (https://phabricator.wikimedia.org/T225988) [18:09:23] (03CR) 10RobH: [C: 03+2] decom db2039 prod dns [dns] - 10https://gerrit.wikimedia.org/r/519474 (https://phabricator.wikimedia.org/T225988) (owner: 10RobH) [18:10:01] (03Merged) 10jenkins-bot: Restrict uploading on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil) [18:10:16] (03CR) 10jenkins-bot: Restrict uploading on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil) [18:11:13] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10RobH) a:05RobH→03Papaul [18:11:56] (03PS1) 10Urbanecm: Remove several wikis from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519475 (https://phabricator.wikimedia.org/T185898) [18:12:16] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:13:43] !log kafka2001 -> kafka-main2001 migration complete. re-enabling alerting on kafka-main2001, and moving kafka2001 to role::spare::system T225005 [18:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:49] T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 [18:13:58] (03PS1) 10Herron: Revert "kafka-main2001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519476 [18:14:07] (03PS2) 10Herron: Revert "kafka-main2001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519476 [18:14:59] (03PS1) 10Urbanecm: Add + in front of wikimaniawiki in GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519477 (https://phabricator.wikimedia.org/T225505) [18:15:14] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519477 (https://phabricator.wikimedia.org/T225505) (owner: 10Urbanecm) [18:16:11] (03Merged) 10jenkins-bot: Add + in front of wikimaniawiki in GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519477 (https://phabricator.wikimedia.org/T225505) (owner: 10Urbanecm) [18:16:25] (03CR) 10jenkins-bot: Add + in front of wikimaniawiki in GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519477 (https://phabricator.wikimedia.org/T225505) (owner: 10Urbanecm) [18:18:52] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:516623|Restrict uploading on wikimaniawiki]], [[:gerrit:519477|Add + in front of wikimaniawiki in GroupOverrides]] (T225505) (duration: 00m 57s) [18:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:58] T225505: Change upload related permissions on wikimania-wiki - https://phabricator.wikimedia.org/T225505 [18:19:25] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 432 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:19:54] (03PS2) 10Urbanecm: Remove several wikis from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519475 (https://phabricator.wikimedia.org/T185898) [18:20:01] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519475 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm) [18:20:26] !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: [[:gerrit:516623|Restrict uploading on wikimaniawiki]] (T225505) (duration: 00m 56s) [18:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:54] (03Merged) 10jenkins-bot: Remove several wikis from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519475 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm) [18:21:06] (03CR) 10jenkins-bot: Remove several wikis from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519475 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm) [18:21:25] (03PS1) 10Andrew Bogott: nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) [18:21:40] (03CR) 10Andrew Bogott: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott) [18:21:50] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott) [18:22:38] !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: [[:gerrit:519475|Remove several wikis from commonsuploads.dblist]] (T185898) (duration: 00m 57s) [18:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:43] T185898: Soft disable uploads doesn't work at some wikis - https://phabricator.wikimedia.org/T185898 [18:22:53] (03PS1) 10SBassett: Add rate limiter to Special:ConfirmEmail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) [18:24:31] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 432 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:25:03] (03PS2) 10Urbanecm: Tidy up groupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898) [18:25:31] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm) [18:26:26] (03Merged) 10jenkins-bot: Tidy up groupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm) [18:26:40] (03CR) 10jenkins-bot: Tidy up groupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm) [18:26:51] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: add basic kubeadm infra [puppet] - 10https://gerrit.wikimedia.org/r/519480 (https://phabricator.wikimedia.org/T215975) [18:29:26] (03PS2) 10SBassett: Add rate limiter to Special:ConfirmEmail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) [18:29:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:519180|Tidy up groupOverrides]] (T185898) (duration: 00m 56s) [18:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:39] T185898: Soft disable uploads doesn't work at some wikis - https://phabricator.wikimedia.org/T185898 [18:31:34] (03Abandoned) 10SBassett: Add rate limiter to Special:ConfirmEmail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [18:33:30] !log Morning SWAT done, namespaceDupes.php still running for T173070 [18:33:33] !log gerrit set-account --active '"Dzahn"' [18:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:35] T173070: Set default aliases for Project_talk namespace - https://phabricator.wikimedia.org/T173070 [18:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:10] (03CR) 10Jhedden: nova-fullstack: add a watch on the number of leaked VMs. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott) [18:38:14] (03PS1) 10Herron: kafka2001 move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/519483 (https://phabricator.wikimedia.org/T225005) [18:38:23] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:40:20] (03CR) 10Dzahn: "thank you for fixing this! it had been overlooked indeed. nice to see that it was fixed while i was gone." [puppet] - 10https://gerrit.wikimedia.org/r/513713 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [18:40:42] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/17156/" [puppet] - 10https://gerrit.wikimedia.org/r/519483 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [18:40:46] (03PS1) 10Urbanecm: Tidy up GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) [18:41:08] (03CR) 10Herron: [C: 03+2] kafka2001 move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/519483 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [18:42:19] (03PS2) 10Urbanecm: Tidy up GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T185898) [18:43:24] (03PS1) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) [18:46:15] (03PS3) 10Urbanecm: Tidy up GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) [18:46:19] !log Reopen Morning SWAT [18:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:40] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [18:46:59] (03CR) 10Urbanecm: [C: 03+2] Tidy up GroupOverrides (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [18:47:08] (03CR) 10Urbanecm: [C: 03+2] "> Patch Set 3:" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [18:47:42] (03Merged) 10jenkins-bot: Tidy up GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [18:47:58] (03CR) 10jenkins-bot: Tidy up GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm) [18:48:05] !log foreachwiki namespaceDupes.php --fix done (T173070) [18:48:08] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) [18:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:10] T173070: Set default aliases for Project_talk namespace - https://phabricator.wikimedia.org/T173070 [18:49:29] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) I note I can also replicate this in incognito mode [18:49:58] !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: [[:gerrit:Tidy up GroupOverrides]], part 1 (T173070) (duration: 00m 57s) [18:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:519484|Tidy up GroupOverrides]] (T173070) (duration: 00m 56s) [18:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:49] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:52:13] !log Morning SWAT done for real [18:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:08] (03CR) 10Bstorm: [C: 03+1] "That should get us going! We'll probably want to scrap our current servers and build again :-p" [puppet] - 10https://gerrit.wikimedia.org/r/519480 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez) [18:54:09] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Urbanecm) Reminds me of T151770. [18:54:59] (03PS1) 10Ottomata: Move page_links_change back to new schema aware refine job [puppet] - 10https://gerrit.wikimedia.org/r/519486 (https://phabricator.wikimedia.org/T226268) [18:57:10] (03PS2) 10Ottomata: Move page_links_change back to new schema aware refine job [puppet] - 10https://gerrit.wikimedia.org/r/519486 (https://phabricator.wikimedia.org/T226268) [18:57:26] (03PS3) 10Ottomata: Move page_links_change back to new schema aware refine job [puppet] - 10https://gerrit.wikimedia.org/r/519486 (https://phabricator.wikimedia.org/T226268) [18:58:01] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10RobH) [18:58:54] (03CR) 10Ottomata: [C: 03+2] Move page_links_change back to new schema aware refine job [puppet] - 10https://gerrit.wikimedia.org/r/519486 (https://phabricator.wikimedia.org/T226268) (owner: 10Ottomata) [18:59:24] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10RobH) I've put in T226756, in the future, please followup with me directly on orders (or file #hardware-requests or #procurement tasks). Assigning a random task... [19:00:05] longma: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - American version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T1900). [19:01:52] twentyafterfour: can we get a 10, 15 min delay on the train? we'd need to do a rb deploy [19:01:52] (03PS1) 10Jeena Huneidi: all wikis to 1.34.0-wmf.11 refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519489 [19:01:55] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.34.0-wmf.11 refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519489 (owner: 10Jeena Huneidi) [19:02:01] uh oh [19:02:14] what should I do now? [19:02:41] longma: You can just remove +2s [19:02:48] I just went ahead and did that for you [19:02:54] oh thanks [19:03:05] ah right you're the train conducter this week longma [19:04:14] so when mobrovac is done, I'll just add back the +2 and continue, right? [19:04:21] correct [19:04:24] thnx longma [19:04:28] appreciate it [19:04:32] thank you. I'm hitting the button. [19:04:51] !log ppchelko@deploy1001 Started deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855 [19:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:58] T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855 [19:08:21] you're welcome mobrovac Pchelolo . I'll wait for you to say the coast is clear [19:12:57] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:15:32] (03PS1) 10Hoo man: Wikidata dumps: Update minimum expected sizes [puppet] - 10https://gerrit.wikimedia.org/r/519493 (https://phabricator.wikimedia.org/T226601) [19:15:34] (03PS1) 10Hoo man: dumpwikidatajson: Fix error code detection [puppet] - 10https://gerrit.wikimedia.org/r/519494 (https://phabricator.wikimedia.org/T226601) [19:16:12] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855 (duration: 11m 21s) [19:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:17] T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855 [19:16:48] and here we go longma [19:17:06] thank you again for giving us this 15 minute delay [19:17:11] it's all smooth and nice [19:18:14] thanks, I'll continue with the train now then [19:20:45] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.34.0-wmf.11 refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519489 (owner: 10Jeena Huneidi) [19:21:39] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.11 refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519489 (owner: 10Jeena Huneidi) [19:21:57] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.11 refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519489 (owner: 10Jeena Huneidi) [19:22:29] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:23:43] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.11 refs T220736 [19:23:45] !log run namespaceDupes.php for wikis in P8674 (T173070) [19:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:48] T220736: 1.34.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T220736 [19:24:53] Urbanecm: Failed to log message to wiki. Somebody should check the error logs. [19:24:55] T173070: Set default aliases for Project_talk namespace - https://phabricator.wikimedia.org/T173070 [19:25:55] stashbot, thanks, added manually :) [19:25:55] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [19:28:19] (03PS1) 10Ladsgroup: grafana: Make the wikimedia logo white [puppet] - 10https://gerrit.wikimedia.org/r/519495 [19:29:21] PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: CRITICAL: nf_conntrack is 99 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:29:53] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 50.63, 24.41, 16.50 [19:30:09] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 68.90, 32.43, 20.06 [19:30:23] PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 67.00, 31.74, 19.89 [19:30:25] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 57.73, 29.62, 17.58 [19:30:27] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 59.90, 30.03, 18.18 [19:30:31] PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 64.05, 31.07, 19.37 [19:30:32] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 76.20, 39.19, 22.90 [19:30:39] PROBLEM - High CPU load on API appserver on mw1313 is CRITICAL: CRITICAL - load average: 72.14, 35.93, 23.07 [19:30:39] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 64.47, 33.95, 19.67 [19:30:41] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 64.55, 35.01, 21.19 [19:30:43] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 64.65, 31.39, 19.58 [19:30:43] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 64.96, 34.35, 21.04 [19:30:47] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 53.39, 29.50, 17.54 [19:30:49] PROBLEM - High CPU load on API appserver on mw1224 is CRITICAL: CRITICAL - load average: 54.50, 26.57, 15.39 [19:30:51] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 54.15, 29.85, 17.53 [19:31:21] PROBLEM - High CPU load on API appserver on mw1315 is CRITICAL: CRITICAL - load average: 72.71, 41.95, 26.14 [19:31:21] PROBLEM - High CPU load on API appserver on mw1344 is CRITICAL: CRITICAL - load average: 78.22, 41.22, 25.36 [19:31:30] Ouch. Train issue? [19:31:35] PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 85.32, 47.43, 27.82 [19:31:40] not sure [19:31:49] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:31:49] RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 33.28, 30.67, 20.68 [19:31:57] RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 30.62, 29.75, 20.07 [19:32:05] RECOVERY - High CPU load on API appserver on mw1313 is OK: OK - load average: 38.46, 34.42, 23.70 [19:32:09] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 32.95, 29.90, 20.13 [19:32:09] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 28.94, 30.08, 20.74 [19:32:19] PROBLEM - Nginx local proxy to apache on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:32:28] this happened on tuesday also [19:32:38] Oh, dying back down. [19:32:42] so it wasn't related to the "run NamespaceDupes.php" part then? [19:33:07] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:33:29] PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 66.59, 37.98, 23.70 [19:33:33] PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:33:41] RECOVERY - Nginx local proxy to apache on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.796 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:33:45] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 59.91, 39.13, 22.93 [19:33:50] mutante: Running namespaceDupes was a SWAT thing, right? [19:34:05] And that's a maintenance script with code in it to back off if it's going to overload the site anyway. [19:34:21] It wouldn't update the API servers either [19:34:24] DB servers, maybe [19:34:28] Is there something I should look at to determine:? [19:34:29] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 74913 bytes in 2.276 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:32] s/update/upset/ [19:34:49] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:35:11] RECOVERY - High CPU load on API appserver on mw1224 is OK: OK - load average: 14.39, 22.75, 16.93 [19:35:55] James_F: yes, and ok. then "just" the issue that has happened before during deployment [19:35:59] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 19.10, 28.84, 23.22 [19:36:09] longma: it seems to be over already [19:36:23] RECOVERY - High CPU load on API appserver on mw1289 is OK: OK - load average: 22.39, 31.36, 23.67 [19:36:24] yeah [19:36:32] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:36:35] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 15.75, 23.94, 19.35 [19:37:09] RECOVERY - High CPU load on API appserver on mw1344 is OK: OK - load average: 21.27, 34.70, 28.77 [19:37:22] RECOVERY - High CPU load on API appserver on mw1347 is OK: OK - load average: 20.04, 34.09, 29.06 [19:37:22] thcipriani just mentioned it's normal for this to happen [19:37:41] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 17.94, 24.29, 20.10 [19:37:41] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 19.97, 25.60, 21.22 [19:37:47] unfortunately, it seems that when deploying train this happens fairly regularly :\ [19:38:10] that is, hhvm will use up a ton of resources for a short time on each appserver [19:38:35] RECOVERY - High CPU load on API appserver on mw1315 is OK: OK - load average: 23.23, 37.96, 32.71 [19:39:05] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:39:17] hhvm sucks [19:40:59] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 18.98, 24.67, 22.07 [19:41:55] PROBLEM - Apache HTTP on mw1275 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:43:21] RECOVERY - Apache HTTP on mw1275 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:44:17] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:45:11] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:46:15] RECOVERY - Check size of conntrack table on kubernetes1002 is OK: OK: nf_conntrack is 76 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:47:30] longma: is the train done? When is it going to be done? [19:47:59] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 16.06, 20.82, 23.76 [19:48:18] Amir1: Should be done [19:48:35] Mostly just watching HHVM noise improve again [19:48:35] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 12.48, 17.72, 23.99 [19:49:11] Reedy: Okay, I want to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/519492 to reduce size of jobqueue by 30% [19:49:32] (I can deploy it in evening SWAT but it's 1am here) [19:49:57] Yeah, just need to wait for longma to respond you're ok to deploy me thinks [19:50:16] sorry Amir1 , yeah train is done [19:50:59] PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:51:15] awesome! [19:53:01] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:55:19] PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:56:37] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 17.17, 17.76, 23.61 [19:58:28] (03CR) 10ArielGlenn: [C: 03+1] "Looks good to me if someone wants to double check the numbers" [puppet] - 10https://gerrit.wikimedia.org/r/519493 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man) [19:59:11] PROBLEM - SSH on bast3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:59:18] (03PS2) 10Andrew Bogott: nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) [19:59:43] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott) [20:02:47] (03PS2) 10Bstorm: toolforge: k8s: add basic kubeadm infra [puppet] - 10https://gerrit.wikimedia.org/r/519480 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez) [20:03:59] PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:06:19] RECOVERY - SSH on bast3002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:07:35] (03PS3) 10Andrew Bogott: nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) [20:07:45] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:07:53] PROBLEM - Prometheus bast3002/ops restarted: beware possible monitoring artifacts on bast3002 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [20:08:00] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott) [20:09:01] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:10:24] (03CR) 10Bstorm: [C: 03+2] toolforge: k8s: add basic kubeadm infra [puppet] - 10https://gerrit.wikimedia.org/r/519480 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez) [20:10:57] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 20.08, 20.47, 23.75 [20:13:48] (03PS1) 10DannyS712: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) [20:14:21] PROBLEM - Prometheus prometheus1003/global restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [20:14:27] (03PS4) 10Andrew Bogott: nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) [20:14:37] PROBLEM - Prometheus prometheus1004/global restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [20:14:41] mutante: Clearly we should move to php72 faster! ;-( [20:14:53] PROBLEM - Prometheus prometheus2003/global restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [20:14:58] Amir1: Have you deployed? [20:15:11] PROBLEM - Prometheus prometheus2004/global restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [20:15:15] James_F: looks like prometheus is restarting on everything [20:15:24] eh, i mean on prometheus* and bast* [20:15:32] and yes @ 7.2 [20:15:35] Fun. [20:15:47] On it [20:16:08] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott) [20:16:16] (03PS5) 10Andrew Bogott: nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) [20:18:49] !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/Wikibase: [[gerrit:519492|Avoid inserting a new addUsage job when the current usage stays untouched (duration: 01m 14s) [20:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:57] James_F: I'm done, I'm continuing to monitor things [20:20:02] Cool. [20:21:42] (03PS2) 10DannyS712: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) [20:22:13] (03Restored) 10SBassett: Add rate limiter to Special:ConfirmEmail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [20:22:19] (03PS3) 10SBassett: Add rate limiter to Special:ConfirmEmail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) [20:24:07] (03PS1) 10Andrew Bogott: nova-fullstack monitoring: fix a misnamed file [puppet] - 10https://gerrit.wikimedia.org/r/519513 (https://phabricator.wikimedia.org/T226647) [20:25:35] (03PS4) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) [20:25:42] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack monitoring: fix a misnamed file [puppet] - 10https://gerrit.wikimedia.org/r/519513 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott) [20:25:44] (03PS5) 10SBassett: Add rate limiter to Special:ConfirmEmail - core change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) [20:26:37] PROBLEM - puppet last run on cloudcontrol1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/check_nova_fullstack_leaks.py] [20:26:59] (03CR) 10jerkins-bot: [V: 04-1] Add rate limiter to Special:ConfirmEmail - core change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [20:27:01] (03PS6) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) [20:31:06] _joe_: https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?panelId=1&fullscreen&orgId=1&from=now-1h&to=now [20:31:07] (03PS3) 10DannyS712: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) [20:31:19] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/MobileFrontend/resources/dist: T221191: Log editor switches to visualeditorfeatureuse (duration: 00m 50s) [20:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:24] T221191: VE mobile default: start tracking editor switches on mobile + desktop - https://phabricator.wikimedia.org/T221191 [20:31:58] <_joe_> Amir1: wow that's amaizing [20:32:03] RECOVERY - puppet last run on cloudcontrol1003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:32:45] YESS \o/ [20:33:10] (03PS4) 10DannyS712: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) [20:34:48] (03CR) 10DannyS712: "1 question, otherwise should be ready to go" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) (owner: 10DannyS712) [20:36:01] RECOVERY - Prometheus prometheus1003/global restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [20:36:21] RECOVERY - Prometheus prometheus1004/global restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [20:36:37] RECOVERY - Prometheus prometheus2003/global restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [20:36:51] RECOVERY - Prometheus bast3002/ops restarted: beware possible monitoring artifacts on bast3002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [20:36:57] RECOVERY - Prometheus prometheus2004/global restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [20:39:06] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10MarcoAurelio) @Qgil Thanks for your comment. As a user of some mailing lists, I am still interesting in upgrading to Mailman 3+. At T52864#5022944 w... [20:40:07] RECOVERY - Check size of conntrack table on kubernetes1001 is OK: OK: nf_conntrack is 78 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:42:20] 10Operations, 10observability: consider running bastion Prometheis inside cgroups - https://phabricator.wikimedia.org/T226769 (10CDanis) [20:43:27] mutante: James_F: re: the prometheus alerts for the 'global' prometheus -- that alert actually comes from Prometheus's internal uptime metric, which is exported by each Prometheus instance to the global ones -- so when one goes, all the globals will alert as well [20:43:46] the documentation for the alert states such, but it's still confusing, and also I added this alert before we supported a notes_url on a check_prometheus alert [20:43:48] which I should now fix :) [20:45:30] Ha. [20:47:03] argh, wait, that's still pending [20:47:30] (03CR) 10CDanis: "What's blocking this? I'd love to start adding notes urls to check_prometheus rules" [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [20:47:46] cdanis: aha! thanks much, also for the notes_url [20:48:10] okay once I can, I'll add a link to https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted for that alert [20:48:43] perhaps I can clarify it for global proms [20:50:30] cool! [20:50:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:50:57] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:50:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:51:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:51:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:51:42] cdanis: only thing blocking https://gerrit.wikimedia.org/r/509365 is review. i have tried to add urls that look sane our we could default to a landing page which says please create a page for this [20:51:43] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:51:56] but i guess that can wait :( [20:52:00] no it's okay [20:52:05] not urgent, just curious [20:52:06] (03PS1) 10CDanis: prometheus restarts: clarify alert for global proms [puppet] - 10https://gerrit.wikimedia.org/r/519517 [20:52:16] jbond42|away: I'll take a pass tonight or tomorrow [20:52:28] ack cheers [20:52:33] (03CR) 10jerkins-bot: [V: 04-1] prometheus restarts: clarify alert for global proms [puppet] - 10https://gerrit.wikimedia.org/r/519517 (owner: 10CDanis) [20:52:35] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [20:52:42] are theses ^^ errors somthing to worry about as im online? [20:52:53] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.5402 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:53:40] going to guess one of our usual 503 spikes [20:53:50] yeah, looks like it, and already resolved [20:54:03] ok thakns chris [20:54:23] cp1075 had a stomachache and caused about 90k 503s [20:55:27] ack [20:56:13] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [20:56:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:56:57] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:56:59] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:57:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:57:15] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:57:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:57:43] (03PS2) 10CDanis: prometheus restarts: clarify alert for global proms [puppet] - 10https://gerrit.wikimedia.org/r/519517 [20:58:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [21:00:11] (03PS3) 10CDanis: prometheus restarts: clarify alert for global proms [puppet] - 10https://gerrit.wikimedia.org/r/519517 [21:01:43] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [21:01:49] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [21:02:21] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [21:02:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [21:02:47] (03CR) 10CDanis: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/17162/" [puppet] - 10https://gerrit.wikimedia.org/r/519517 (owner: 10CDanis) [21:02:57] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [21:03:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:07:43] i have 500's again on commons page [21:07:58] * Connection state changed (MAX_CONCURRENT_STREAMS updated)! [21:07:58] < HTTP/2 500 [21:07:58] < date: Thu, 27 Jun 2019 21:07:30 GMT [21:07:58] < server: Varnish [21:07:58] < [21:08:00] * Connection #0 to host commons.m.wikimedia.org left intact [21:08:13] blank white page. like the http/2 connection just vanished [21:08:33] from .nl [21:09:17] thedj: we just had a short spike .. "cp1075 had a stomachache and caused about 90k 503s" [21:09:31] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Development services), and 3 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) Thanks @MoritzMuehlenhoff ! I really appreciate it! I'll insta... [21:09:40] https://phabricator.wikimedia.org/P8679 [21:10:56] mutante: :D [21:11:15] ah, that's different thedj [21:11:54] seems pretty consistent... [21:12:45] thedj: could be https://phabricator.wikimedia.org/T209590 [21:12:58] i notice " Using HTTP2" [21:13:34] the "MAX_CONCURRENT_STREAMS" string appears in both [21:14:08] hauskatze: hello [21:14:36] hmmm [21:14:40] thedj: that's really interesting [21:15:05] have it on http 1.1 as well [21:15:15] if i force the client to 1.1 [21:15:40] and it's on the mobile page, but not the standard commons page [21:19:19] what's weird is that it seems to die on the varnish layer, because otherwise i'd see the varnish error page right ? [21:19:33] yeah, you're also missing a bunch of response headers I'd expect even on a varnish error [21:19:53] like X-Cache should still be there [21:20:23] I'll collect some logs and file a ticket [21:20:33] so.. tls termination or something ? [21:20:42] cdanis: thx [21:21:00] I don't think so, although I don't know for sure; I suspect either weird internal varnish error, or a bug in our VCL [21:22:16] and only when logged in.. [21:22:37] yeah, I just noticed that [21:22:47] if I omit your Cookie header from the curl command line, everything is fine [21:23:20] (03PS1) 10Ppchelko: Remove references to pdfrender from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/519526 (https://phabricator.wikimedia.org/T226675) [21:24:53] (03CR) 10Ppchelko: "Step 1 from https://gerrit.wikimedia.org/r/c/operations/puppet/+/514226#message-10d3c9fb179f10e94b7b35bc1a308b438be972af" [puppet] - 10https://gerrit.wikimedia.org/r/519526 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [21:25:27] (03PS1) 10Bstorm: toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) [21:25:50] (03CR) 10jerkins-bot: [V: 04-1] toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [21:35:20] (03PS2) 10Bstorm: toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) [21:35:44] (03CR) 10jerkins-bot: [V: 04-1] toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [21:35:56] brion: got 1.5h before the next deploy window, so no worries :) [21:36:28] Krinkle: shall i wait for that window then? :) [21:36:36] * brion is in new york so time zones are confusing [21:37:47] 10Operations, 10SRE-Access-Requests: please re-activate LDAP access for Dzahn - https://phabricator.wikimedia.org/T226744 (10Dzahn) Thanks, confirmed working Wikitech (with new password after reset) and then Gerrit and Phabricator after i got separately unblocked there as well. [21:39:06] 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10CDanis) [21:39:43] brion: If you'd like one of the SWAT-deployers to roll it out instead, you can add it to the wiki page for that window instead. [21:39:53] Or as deployer, could roll it out now. [21:40:34] I meant that we have 1.5 before someone else wants to deploy something :) [21:41:18] cool, ok :D [21:41:24] yeah as soon as it merges i'm good [21:42:11] aaaand there it is [21:42:19] (03PS3) 10Bstorm: toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) [21:43:21] ok it at least doesn't break production web views, but that's not surprising ;) [21:43:48] !mwlog deploying fix for TMH jobqueue bug T226748 [21:43:49] T226748: WebVideoTranscodeJob fatal: Call to getStdout() on a non-object - https://phabricator.wikimedia.org/T226748 [21:44:02] hmm did i misremember that loggy thing [21:44:08] !log deploying fix for TMH jobqueue bug T226748 [21:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:14] yay [21:44:19] brion: check with Reedy btw.. https://phabricator.wikimedia.org/T226713 [21:44:19] :D [21:44:58] brion: i see he was also looking at running requeueTranscodes [21:44:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:45:34] thedj: that probably explains why we saw multiple fatals in production, something failed out during that and barfed :D [21:45:44] brion: likely [21:45:59] ok i won't re-run it then, until we have time to figure out what the files were failing on [21:46:21] it's still running [21:46:28] it's only on Fo files [21:47:08] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [21:47:41] Reedy: yeah later i should add better filter options so we can re-transcode a single file type rather than a whole media type [21:47:49] brion: I filed a bug for that :P [21:47:51] :D [21:47:59] https://phabricator.wikimedia.org/T226718 [21:48:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:48:45] great, i won't do that immediately as i'm technically on vacation today and tomorrow ;) but can fix it up next week. remind me if i don't poke it [21:48:59] should be pretty easy [21:49:31] ok it claims to be done [21:49:31] (03PS1) 10Andrew Bogott: nova-fullstack: fix name of nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/519531 (https://phabricator.wikimedia.org/T226647) [21:50:09] yeah, exactly. Just needs a param for the other two columns in the index [21:51:04] Krinkle: thx for the bug report :D i'll check the logs after more files have run to see if there's any commonality to the transcode errors that were triggering the fatals [21:51:16] Reedy: and thanks for running the batch script :D [21:51:34] which reminds me to open a ticket on that piece of software that it should use exitcodes... [21:51:40] thedj: and thanks for being awesome! [21:51:50] brion: np. go vacation you. [21:51:54] :D [21:52:04] :) [21:52:05] achievement unlocked: deployed from a hotel bar [21:52:13] gods :P [21:52:16] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: fix name of nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/519531 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott) [21:52:42] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:53:10] cdanis: and it works .. probably this then ^^^ [21:53:36] oh, so it was an app layer problem? interesting. still not sure what's up with the missing headers [21:53:45] or.. it was my old cookie... [21:54:01] that's the only other thing that changed since the last refresh [21:54:38] (03PS2) 1020after4: phabricator: Allow admins to silence maniphest bulk jobs via sudo [puppet] - 10https://gerrit.wikimedia.org/r/517140 [21:54:40] (03PS1) 1020after4: phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) [21:55:35] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [21:55:35] ok signing off for now, but do ping me if anything explodes unexpectedly, i won't go far. ;) [21:55:45] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [21:56:21] 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10CDanis) So, it looks like that this 500 did in fact come from the application layer... but shouldn't we still be getting more response headers from the edge? [21:56:23] (03PS2) 1020after4: phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) [21:56:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:56:40] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:57:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:57:17] sigh [21:57:17] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [21:57:22] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:57:22] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:57:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:57:36] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:57:44] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [21:58:15] !log cdanis@cp1075.eqiad.wmnet ~ % sudo -i varnish-backend-restart [21:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:05] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:59:32] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [21:59:45] 10Operations, 10observability: consider running bastion Prometheis inside cgroups - https://phabricator.wikimedia.org/T226769 (10CDanis) 05Open→03Invalid I'm told the plan is to move these onto Ganeti in PoPs, so that seems just as good. [22:00:06] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [22:00:29] 10Operations, 10observability: consider running bastion Prometheus inside cgroups - https://phabricator.wikimedia.org/T226769 (10faidon) [22:01:05] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.3982 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [22:01:12] (03PS2) 10Gilles: Serve JPG when WEBP conversion fails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) [22:01:15] (03CR) 10Gilles: Serve JPG when WEBP conversion fails (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles) [22:02:50] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:03:15] 10Operations, 10observability: consider running bastion Prometheis inside cgroups - https://phabricator.wikimedia.org/T226769 (10faidon) [22:03:36] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:03:38] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [22:03:50] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:04:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:04:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:05:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:05:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:05:28] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:06:22] 10Operations, 10ops-eqiad, 10DC-Ops: install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) p:05Triage→03Normal [22:06:43] 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 5 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac) Regarding the deployment plan, the main pain point is that we will need to ha... [22:08:22] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [22:09:10] PROBLEM - puppet last run on phab1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 28 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php7.2-mysqlnd] [22:09:20] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [22:10:02] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [22:10:44] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [22:11:08] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [22:11:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:11:47] 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Gilles) At a glance on a given proxy the same object doesn't occur multiple times in a row. But the same desti... [22:12:51] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) (owner: 10DannyS712) [22:13:13] (03PS3) 1020after4: phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) [22:14:21] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) [22:16:12] 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team: cron spam for slow queries on mwmaint /usr/local/bin/foreachwiki initSiteStats.php --update > /dev/null - https://phabricator.wikimedia.org/T216243 (10Krinkle) [22:22:28] (03CR) 10Urbanecm: [C: 04-2] "> This can wait until July 11." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic) [22:25:34] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 132.3 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [22:25:44] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH) [22:26:40] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH) [22:27:11] 10Operations, 10ops-eqiad, 10DC-Ops: install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [22:27:34] (03CR) 10Dzahn: [C: 03+1] phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [22:28:41] 10Operations, 10ops-eqiad, 10DC-Ops: install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [23:00:05] MaxSem, RoanKattouw, and Niharika: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T2300). [23:00:05] Amir1: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] I'll SWAT [23:00:21] (and I also have a patch the bot didn't notice [23:01:38] Amir1: Are you here for your SWAT? It's pretty late for you [23:12:36] (03CR) 1020after4: [C: 03+1] phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [23:14:17] REPORT OUTAGE: mediawiki changestream is currently down. (https://stream.wikimedia.org/v2/stream/recentchange) https://www.irccloud.com/pastebin/VdGZnjZz/ [23:15:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:20:12] 14:44:09 !log deploying fix for TMH jobqueue bug T226748 [23:20:13] T226748: WebVideoTranscodeJob fatal: Call to getStdout() on a non-object - https://phabricator.wikimedia.org/T226748 [23:20:41] brion: Did you reploy this to a different server or something? It isn't on deploy1001, I just pulled something else on there and this came with it [23:20:56] RoanKattouw: hmm, lemme double-check [23:21:46] Current state of deploy1001: I've run git pull (which pulled in the wmf.11 commit updating the submodule) but not git submodule update extensions/TMH (which would check out that commit in TMH) [23:22:06] RoanKattouw: it was deploy1001 yeah [23:22:17] Then what/how did you deploy it? [23:22:18] ah i best i forgot a step [23:22:19] *bet [23:22:37] I don't see any syncs in the log, and it came riding in with my git pull so you couldn't have pulled it on there either [23:22:45] Perhaps you ran git pull in the config repo or the wmf.10 checkout? [23:22:58] yep, i completely failed to pull in the patch :D [23:23:05] * brion slaps self [23:23:07] In any case, if you're around now, I can deploy it for you now and you can see if it worked :) [23:23:11] great :D [23:23:42] It's on mwdebug1002, in case it's testable there (job queue stuff isn't always) [23:24:33] RoanKattouw: yeh no way to test it from mwdebug1002 [23:24:44] that's what i hate about deploying these job queue changes, they're very hard to test currently [23:25:04] just gotta wait until it triggers another fatal in the background jobs (or fails to do so) [23:25:48] !log roan is fixing deploy of T226748 which failed to include the patch (whoops) [23:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:54] T226748: WebVideoTranscodeJob fatal: Call to getStdout() on a non-object - https://phabricator.wikimedia.org/T226748 [23:26:40] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/GrowthExperiments/includes/HomepageHooks.php: Fix JS error on Special:Homepage (duration: 00m 50s) [23:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:55] 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, and 2 others: Phabricator email comments not posted - https://phabricator.wikimedia.org/T224752 (10Dzahn) nice fix! sorry, i was off for quite some time and reading this now. indeed a good catch that we did not catch during the migration. thanks all! [23:28:56] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/TimedMediaHandler/: T226748 (duration: 00m 50s) [23:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:25] next deploy i'll be sure to follow all the steps more carefully :D [23:30:27] thanks roan! [23:31:41] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Qgil) To be clear, the proposal is that users of a Mailing mailing list about X could keep using the same email features subscribing in mailing list... [23:33:04] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10revi) Q: Does discourse support 'mailing list mode' with NO archives left after it is distributed? At least, that's how it works on [[https://lists.wikime... [23:43:04] PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:44:22] RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers