[00:00:04] <jouncebot>	 twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T0000). Please do the needful.
[00:19:36] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10Varnent) @jbond - correct - that is the mailing list that we are referring to. :)
[00:23:55] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 27.76 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:24:45] <wikibugs>	 10Operations, 10Cassandra, 10Dependency-Tracking, 10Wikibase-Quality, and 7 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10Addshore) 05Open→03Stalled Stalled on the RFC
[00:25:23] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 75.95 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:40:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:40:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:40:53] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:40:53] <icinga-wm>	 PROBLEM - wiki content on commons on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/
[00:40:53] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[00:40:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:40:53] <paladox>	 hmm
[00:40:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:40:53] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[00:40:53] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator
[00:40:53] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[00:40:53] <paladox>	 phan and grafana are not loading for me
[00:40:53] <paladox>	 *phab
[00:40:53] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 58.92 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:40:53] <paladox>	 wikipedia is not loading either or is very slow
[00:40:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:40:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:40:53] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/71/
[00:40:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:40:53] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[00:40:53] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:40:53] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[00:40:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:40:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:40:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[00:40:53] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[00:47:21] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[00:59:04] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 2.420 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[00:59:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:59:35] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[01:00:49] <icinga-wm>	 PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100%
[01:00:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:00:59] <icinga-wm>	 RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[01:01:20] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:01:45] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[01:01:51] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:01:55] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator
[01:01:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:01:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:01:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:02:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:02:14] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100%
[01:03:09] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 55.81 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:03:12] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[01:03:12] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[01:03:19] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 36523 bytes in 0.339 second response time https://wikitech.wikimedia.org/wiki/Phabricator
[01:03:20] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:03:20] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:03:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:03:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:03:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:03:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:06:09] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 100.8 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:06:19] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:07:01] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:07:33] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[01:07:38] <icinga-wm>	 PROBLEM - wiki content on commons on commons.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection. https://phabricator.wikimedia.org/project/view/1118/
[01:07:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:07:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:07:51] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[01:07:56] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:07:57] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator
[01:07:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:07:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:07:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:08:03] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[01:08:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:08:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:08:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:08:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:08:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:08:31] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:08:41] <Pchelolo>	 wtf
[01:08:58] <James_F>	 Prod is unwell.
[01:08:59] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[01:09:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:09:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:09:10] <icinga-wm>	 RECOVERY - wiki content on commons on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 162825 bytes in 1.035 second response time https://phabricator.wikimedia.org/project/view/1118/
[01:09:49] <Isarra>	 In case this is news to folks, might be worth noting that this does not seem to be wikimedia-specific. A LOT of things have gone down.
[01:09:53] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[01:09:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:09:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:09:58] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:09:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:10:16] <Isarra>	 I'm just going to assume sharks ate the internet tubes and we're all doomed.
[01:10:47] <paladox>	 network issues i think Isarra 
[01:10:47] <wikibugs>	 (03PS1) 10BBlack: emergency depool eqiad front edge [dns] - 10https://gerrit.wikimedia.org/r/519319
[01:10:47] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[01:10:49] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 36523 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Phabricator
[01:10:53] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:10:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:10:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:10:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:11:01] <bblack>	 yes there's an ongoing issue being addressed, SRE is working on it
[01:11:21] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] emergency depool eqiad front edge [dns] - 10https://gerrit.wikimedia.org/r/519319 (owner: 10BBlack)
[01:11:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:11:54] <bblack>	 !log depool eqiad front edge
[01:12:02] <bblack>	 !log depool eqiad front edge (in DNS, I meant)
[01:12:39] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[01:12:54] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 3.327 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:13:05] <icinga-wm>	 PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid
[01:13:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:09] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:13:09] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:13:12] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:13:13] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[01:13:14] <librenms-wmf>	 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80%
[01:13:15] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[01:13:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:20] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:13:23] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[01:13:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:13:55] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:14:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:14:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:14:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:14:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:14:38] <niedzielski>	 I'm getting timeouts and a weird ERR_SSL_VERSION_INTERFERENCE error trying to load logged in and anonymous pages. Works fine on my phone but not on a couple computers over here. Same issue on Firefox, Chrome, and Chromium.
[01:14:46] <andrewbogott>	  /mode #wikimedia-operations +o andrewbogott
[01:14:50] <andrewbogott>	 bah!
[01:14:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:14:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:14:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:15:05] <niedzielski>	 Other websites seem to be fine.
[01:15:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:15:08] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:15:11] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[01:15:12] <icinga-wm>	 PROBLEM - Host text-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[01:15:15] <paladox>	 niedzielski it's known, SRE are looking.
[01:15:43] <James_F>	 andrewbogott: `/cs op #wikimedia-operations andrewbogott`. :-)
[01:15:51] <Isarra>	 But also sharks.
[01:16:03] <addshore>	 :)
[01:16:05] <andrewbogott>	 thanks James_F 
[01:16:14] <niedzielski>	 thanks paladox !
[01:16:40] <James_F>	 Thanks, andrewbogott.
[01:16:57] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:17:02] <icinga-wm>	 RECOVERY - Host text-lb.eqiad.wikimedia.org is UP: PING WARNING - Packet loss = 37%, RTA = 10.08 ms
[01:17:20] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[01:17:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:26] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[01:17:27] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:17:27] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:17:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:35] <icinga-wm>	 RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid
[01:17:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:39] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:17:39] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[01:17:40] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15746 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:17:43] <icinga-wm>	 RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[01:17:53] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[01:17:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:18:11] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[01:19:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:19:21] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[01:19:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:19:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:20:46] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:20:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:21:16] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 36.14 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:21:30] <icinga-wm>	 PROBLEM - wiki content on commons on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/
[01:22:20] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:22:36] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15758 bytes in 1.101 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:22:52] <icinga-wm>	 RECOVERY - wiki content on commons on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 162877 bytes in 0.271 second response time https://phabricator.wikimedia.org/project/view/1118/
[01:24:33] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:25:10] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:25:14] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15745 bytes in 0.746 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:25:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw1257.eqiad.wmnet, mw1246.eqiad.wmnet, mw1238.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[01:25:30] <paladox>	 surprisingly gerrit is the only service working for me.
[01:26:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:27:35] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:28:31] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1242.eqiad.wmnet, mw1238.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[01:29:52] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:30:09] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) t
[01:30:09] <icinga-wm>	  response was received https://wikitech.wikimedia.org/wiki/RESTBase
[01:31:35] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[01:31:42] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 54%, RTA = 84.42 ms
[01:32:04] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 58%, RTA = 84.62 ms
[01:32:28] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:33:51] <icinga-wm>	 PROBLEM - pybal on lvs1013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[01:34:01] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[01:34:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal
[01:34:37] <icinga-wm>	 PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100%
[01:34:54] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ed1a::1)
[01:35:19] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 2.321e+05 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:35:46] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[01:37:13] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1013 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[01:38:48] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15746 bytes in 2.788 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:38:48] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1249.eqiad.wmnet, mw1254.eqiad.wmnet, mw1248.eqiad.wmnet, mw1252.eqiad.wmnet, mw1255.eqiad.wmnet, mw1246.eqiad.wmnet are marked down but pooled: apaches_80: Servers mw1241.eqiad.wmnet, mw1243.eqiad.wmnet, mw1256.eqiad.wmnet, mw1249.eqiad.wmnet, mw1246.eqiad.wmnet, mw1258.eqiad.wmnet, mw1239.eqiad.wmnet are mar
[01:38:48] <icinga-wm>	 ed https://wikitech.wikimedia.org/wiki/PyBal
[01:39:02] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:40:05] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:40:22] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:41:35] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[01:41:42] <wikibugs>	 (03PS1) 10BBlack: emergency depool esams front edge [dns] - 10https://gerrit.wikimedia.org/r/519320
[01:43:14] <librenms-wmf>	 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80%
[01:43:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[01:43:30] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] emergency depool esams front edge [dns] - 10https://gerrit.wikimedia.org/r/519320 (owner: 10BBlack)
[01:43:45] <bblack>	 !log depool esams front edge
[01:47:20] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[01:47:39] <icinga-wm>	 PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[01:47:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal
[01:52:31] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal
[01:54:41] <icinga-wm>	 PROBLEM - pybal on lvs1013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[01:59:51] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 70.71 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[02:01:21] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 27 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[02:04:17] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 45.9 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[02:04:53] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 0.04016 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[02:05:44] <icinga-wm>	 PROBLEM - kvm ssl cert on cloudvirt1024 is CRITICAL: Certificate will expire https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:05:49] <icinga-wm>	 PROBLEM - kvm ssl cert on cloudvirt1018 is CRITICAL: Certificate will expire https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:07:14] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[02:07:15] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 79.08 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[02:07:19] <wikibugs>	 (03CR) 10Catrope: [C: 03+1] Betalabs: Enable GrowthExperiments features for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan)
[02:08:20] <icinga-wm>	 ACKNOWLEDGEMENT - kvm ssl cert on cloudvirt1018 is CRITICAL: Certificate will expire andrew bogott non-urgent but discussed in T225484 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:08:21] <icinga-wm>	 ACKNOWLEDGEMENT - kvm ssl cert on cloudvirt1024 is CRITICAL: Certificate will expire andrew bogott non-urgent but discussed in T225484 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:09:21] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 169.7 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[02:11:29] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:12:29] <bblack>	 !log lvs3001: powercycle, unresponsive console
[02:13:13] <librenms-wmf>	 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80%
[02:15:53] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:15:56] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 80%, RTA = 84.36 ms
[02:15:57] <icinga-wm>	 RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:16:07] <icinga-wm>	 RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 50%, RTA = 84.32 ms
[02:17:14] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[02:17:14] <bblack>	 !log lvs3003: powercycle, unresponse console
[02:18:54] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.33 ms
[02:19:08] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 0.519 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[02:19:31] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[02:20:09] <icinga-wm>	 RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 83.44 ms
[02:24:15] <icinga-wm>	 RECOVERY - pybal on lvs1013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[02:24:27] <icinga-wm>	 RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[02:24:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:24:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:25:19] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 67 connections established with conf1004.eqiad.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal
[02:26:25] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1013 is OK: OK: 4 connections established with conf1004.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[02:26:35] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 4 probes of 471 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[02:28:14] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80%
[02:32:14] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[02:39:45] <wikibugs>	 (03PS1) 10BBlack: Revert "emergency depool esams front edge" [dns] - 10https://gerrit.wikimedia.org/r/519325
[02:39:48] <wikibugs>	 (03PS1) 10BBlack: Revert "emergency depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/519326
[02:40:27] <bblack>	 !log re-pooling esams+eqiad front edge traffic
[02:40:34] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "emergency depool esams front edge" [dns] - 10https://gerrit.wikimedia.org/r/519325 (owner: 10BBlack)
[02:40:40] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "emergency depool eqiad front edge" [dns] - 10https://gerrit.wikimedia.org/r/519326 (owner: 10BBlack)
[02:47:37] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:47:55] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:48:15] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[02:48:17] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:48:17] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:48:29] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[02:48:47] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[02:49:01] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[02:49:26] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[02:49:26] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[02:49:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:49:51] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[02:50:31] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[02:51:07] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[02:51:09] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[02:51:37] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[02:52:01] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 49.36 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[02:52:14] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[02:52:25] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:53:27] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator
[02:53:29] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[02:53:36] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[02:53:49] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[02:54:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:54:49] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 36522 bytes in 1.484 second response time https://wikitech.wikimedia.org/wiki/Phabricator
[02:54:51] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedi
[02:54:51] <icinga-wm>	 se
[02:55:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:55:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:55:39] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[02:56:30] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[02:56:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:56:35] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[02:56:38] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15809 bytes in 8.992 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[02:56:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:56:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:56:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:56:59] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:57:03] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[02:57:16] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 107 probes of 471 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[02:57:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:57:45] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[02:57:57] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[02:57:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:58:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:58:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:58:11] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[02:58:11] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[02:58:13] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:58:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:58:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:58:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:58:29] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[02:58:31] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:58:31] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:58:45] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 47.92 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[02:59:01] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 56 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[02:59:13] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /a
[02:59:13] <icinga-wm>	 nnouncements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[02:59:43] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:00:32] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2804 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[03:00:41] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 101 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[03:00:56] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15848 bytes in 4.393 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[03:01:10] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[03:01:45] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 82.39 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[03:02:13] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[03:02:27] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2846 bytes in 0.455 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[03:02:34] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15811 bytes in 0.517 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[03:03:13] <librenms-wmf>	 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80%
[03:03:36] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15849 bytes in 0.471 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[03:05:32] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15862 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[03:07:13] <librenms-wmf>	 04Critical Alert for device cr2-eqsin.wikimedia.org - Primary inbound port utilisation over 80%
[03:08:21] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[03:09:59] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[03:10:05] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[03:10:39] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:10:41] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[03:10:41] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:10:41] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:10:53] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[03:11:21] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[03:11:29] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:11:33] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[03:11:37] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 21 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[03:11:47] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[03:15:01] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[03:15:37] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[03:16:21] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:16:29] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[03:16:41] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[03:16:57] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[03:17:23] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[03:17:41] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28236 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:18:27] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[03:25:05] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:26:13] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 74.07 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[03:28:01] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:28:42] <wikibugs>	 (03PS1) 10BBlack: depool eqsin and eqiad front edges [dns] - 10https://gerrit.wikimedia.org/r/519334
[03:29:12] <bblack>	 !log depooling eqsin + eqiad edges
[03:29:31] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] depool eqsin and eqiad front edges [dns] - 10https://gerrit.wikimedia.org/r/519334 (owner: 10BBlack)
[03:33:14] <librenms-wmf>	 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80%
[03:37:13] <librenms-wmf>	 04Critical Alert for device cr2-eqsin.wikimedia.org - Primary inbound port utilisation over 80%
[03:38:20] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[03:38:59] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 42.82 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[03:40:15] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 31.82 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[03:40:46] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 117 probes of 471 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[03:47:25] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 59 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[03:52:20] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[03:52:53] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 19 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[03:56:45] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[04:02:14] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 80%
[04:02:37] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 471 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[04:03:14] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80%
[04:15:25] <bblack>	 !log lvs1013: enable puppet + pybal
[04:15:45] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:16:52] <bblack>	 !log lvs1016: enable puppet + pybal
[04:18:55] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[04:20:07] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 56.75 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:20:07] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:26:17] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[04:27:19] <wikibugs>	 (03PS1) 10BBlack: Revert "depool eqsin and eqiad front edges" [dns] - 10https://gerrit.wikimedia.org/r/519342
[04:28:56] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "depool eqsin and eqiad front edges" [dns] - 10https://gerrit.wikimedia.org/r/519342 (owner: 10BBlack)
[04:30:04] <bblack>	 !log re-pooling eqsin+eqiad front edges
[04:31:53] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 84.31 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:32:07] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 95.25 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:34:14] <marostegui>	 !log Start replication on labsdb1011 - T222978
[04:34:27] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:34:51] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:34:51] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:35:35] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:36:05] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[04:36:15] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[04:36:19] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[04:36:33] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[04:36:59] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[04:37:47] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:37:47] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[04:38:18] <marostegui>	 !log Stop MySQL on dbstore1005 for upgrade - T226358
[04:38:33] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.21 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:38:33] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:38:51] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:39:01] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[04:39:13] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:39:59] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:40:19] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:40:35] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 53.45 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:40:39] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:40:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:40:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:40:41] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[04:40:51] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[04:41:05] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[04:41:21] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[04:41:33] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[04:41:33] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[04:43:40] <wikibugs>	 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10A2093064)
[04:44:16] <wikibugs>	 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Marostegui) We are looking into general connectivity issues at the moment
[04:44:23] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.3036 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[04:45:54] <[1997kB]>	 cp1087, Varnish XID 308445704
[04:45:55] <[1997kB]>	 Error: 503, Backend fetch failed at Thu, 27 Jun 2019 04:44:55 GMT
[04:47:19] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[04:47:37] <marostegui>	 [1997kB]: we are looking into some connectivity issues at the moment
[04:49:12] <[1997kB]>	 alright. ty
[04:49:27] <bblack>	 !log restarting varnish-backend on cp1087 (seems unhealthy!)
[04:51:49] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[04:52:17] <wikibugs>	 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Marostegui) @BBlack restarted varnish on that host. It should be ok now.
[04:52:32] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2743 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:52:52] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2829 bytes in 1.399 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:53:00] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2822 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:53:06] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[04:53:26] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2846 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:54:02] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15757 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:54:25] <yannf>	 some problem on Commons?
[04:54:47] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[04:54:50] <yannf>	 Request from 106.213.175.182 via cp1077 cp1077, Varnish XID 532578964
[04:54:50] <yannf>	 Error: 503, Backend fetch failed
[04:54:53] <wikibugs>	 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Pruem) cp1077 is also producing this right now.
[04:55:04] <marostegui>	 yannf: we are having connectivity issues, we are on it
[04:55:05] <JJMC89[m]>	 There are general coonnectivity issues for all sites right now.
[04:55:06] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2821 bytes in 1.402 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:55:07] <Shanmugamp7>	 yannf:  i am seeing in meta too
[04:55:15] <yannf>	 marostegui, ok thanks
[04:55:31] <wikibugs>	 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Marostegui) We are having general connectivity issues
[04:56:24] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15858 bytes in 0.462 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:56:36] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15848 bytes in 1.268 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:57:24] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15861 bytes in 1.256 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:57:31] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15848 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:00:26] <wikibugs>	 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Antigng) It seems that only requests coming through Varnish frontends at eqiad are affected.
[05:01:31] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:02:01] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:02:13] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:02:13] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:02:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:02:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:02:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:02:49] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:03:35] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:03:55] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:07:01] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[05:07:09] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[05:07:29] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[05:07:55] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[05:09:07] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[05:09:56] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[05:11:33] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 71.91 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[05:13:56] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.99 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[05:15:38] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not re-image db1133 and dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/519343
[05:17:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not re-image db1133 and dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/519343 (owner: 10Marostegui)
[05:26:03] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Marostegui)
[05:27:00] <wikibugs>	 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui)
[05:27:14] <marostegui>	 !log Remove db1068 from tendril and zarcillo - T226689
[05:28:48] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db1068 [puppet] - 10https://gerrit.wikimedia.org/r/519344
[05:29:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Decommission db1068 [puppet] - 10https://gerrit.wikimedia.org/r/519344 (owner: 10Marostegui)
[05:31:10] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Decommission db1068 [puppet] - 10https://gerrit.wikimedia.org/r/519344
[05:33:35] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Marostegui) p:05Triage→03Normal
[05:40:29] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:40:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:41:01] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:41:01] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:41:01] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:41:01] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:41:15] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:41:43] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:41:49] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:41:55] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:42:13] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[05:42:51] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[05:43:09] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[05:43:41] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[05:43:51] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[05:44:11] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[05:47:39] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:47:45] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:47:51] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:47:51] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:48:05] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:48:25] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:48:25] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:48:26] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:48:26] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:48:39] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:52:01] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[05:52:41] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[05:52:59] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[05:54:01] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[05:54:01] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[05:54:39] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[05:59:53] <wikibugs>	 (03PS4) 10Gilles: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661)
[06:00:12] <wikibugs>	 (03CR) 10Gilles: "Added a flag to toggle the feature on/off" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[06:00:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[06:02:26] <wikibugs>	 (03PS5) 10Gilles: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661)
[06:21:24] <wikibugs>	 (03PS6) 10Elukey: analytics::refinery::job::data_purge add deletion for data_quality_hourly [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns)
[06:27:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge add deletion for data_quality_hourly [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns)
[06:30:29] <icinga-wm>	 PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[06:31:01] <icinga-wm>	 PROBLEM - puppet last run on db2108 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[06:31:57] <icinga-wm>	 PROBLEM - puppet last run on debmonitor2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[06:33:47] <elukey>	 !log restart rsyslog on wezen - T199406
[06:35:21] <elukey>	 !log restart rsyslog on lithium - T199406
[06:35:27] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[06:35:35] <elukey>	 both were stuck 
[06:36:29] <icinga-wm>	 RECOVERY - puppet last run on db2108 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:38:01] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 849 days) https://wikitech.wikimedia.org/wiki/Logs
[06:43:11] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:43:17] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:56:10] <icinga-wm>	 RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[06:58:10] <icinga-wm>	 RECOVERY - puppet last run on debmonitor2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:02:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Mask the default uwsgi service on puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/519350
[07:03:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Mask the default uwsgi service on puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/519350 (owner: 10Muehlenhoff)
[07:05:42] <wikibugs>	 (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/519350 (owner: 10Muehlenhoff)
[07:08:14] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/519350 (owner: 10Muehlenhoff)
[07:11:38] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:11:54] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:24:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1068 [puppet] - 10https://gerrit.wikimedia.org/r/519344 (owner: 10Marostegui)
[07:24:08] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Decommission db1068 [puppet] - 10https://gerrit.wikimedia.org/r/519344
[07:26:55] <marostegui>	 !log Stop MySQL on db1068 for decommission - T226689
[07:28:55] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Marostegui)
[07:29:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Marostegui) a:05Marostegui→03RobH This host is ready for DCOPs to take over.
[07:29:56] <wikibugs>	 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui)
[07:38:11] <wikibugs>	 (03PS2) 10Muehlenhoff: Mask the default uwsgi service on puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/519350
[07:42:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Mask the default uwsgi service on puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/519350 (owner: 10Muehlenhoff)
[07:42:46] <wikibugs>	 (03PS2) 10Gehel: wdqs: publish full MDC in file based logs. [puppet] - 10https://gerrit.wikimedia.org/r/519046
[07:43:20] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs: publish full MDC in file based logs. [puppet] - 10https://gerrit.wikimedia.org/r/519046 (owner: 10Gehel)
[07:45:54] <icinga-wm>	 RECOVERY - DPKG on puppetboard2001 is OK: All packages OK
[07:47:18] <icinga-wm>	 RECOVERY - Check systemd state on puppetboard2001 is OK: OK - running: The system is fully operational
[07:47:38] <icinga-wm>	 RECOVERY - puppet last run on puppetboard2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:50:29] <wikibugs>	 (03PS1) 10Elukey: cdh::oozie: add hive2/hcat credentials classes [puppet] - 10https://gerrit.wikimedia.org/r/519355 (https://phabricator.wikimedia.org/T212259)
[07:50:41] <volans>	 thanks for fixing it moritzm!
[07:51:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cdh::oozie: add hive2/hcat credentials classes [puppet] - 10https://gerrit.wikimedia.org/r/519355 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey)
[07:51:55] <wikibugs>	 (03PS1) 10Gehel: wdqs: fix pattern in log configuration [puppet] - 10https://gerrit.wikimedia.org/r/519356
[07:52:15] <wikibugs>	 (03PS2) 10Gehel: wdqs: fix pattern in log configuration [puppet] - 10https://gerrit.wikimedia.org/r/519356
[07:52:48] <wikibugs>	 10Operations, 10Community-Relations, 10Traffic, 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Zache) Reported again in [[ https://fi.wikipedia.org/wiki/W...
[07:52:53] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wdqs: fix pattern in log configuration [puppet] - 10https://gerrit.wikimedia.org/r/519356 (owner: 10Gehel)
[07:58:24] <wikibugs>	 10Operations, 10Community-Relations, 10Traffic, 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10ArielGlenn) Can we get approximate times for these last rep...
[08:14:30] <wikibugs>	 10Operations, 10Community-Relations, 10Traffic, 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Ejs-80) @ArielGlenn, two fiwiki users reported about this b...
[08:14:36] <Elitre>	 just passing by to wish you guys a great day. 
[08:20:30] <wikibugs>	 10Operations, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10MoritzMuehlenhoff)
[08:20:59] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10ema) Just bumped into another very frequent one:  ` 08:18:54 ema@cp5005.eqsin.wmnet:~...
[08:21:21] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: toolforge: configure kubernetes node using TLS instead of token auth [puppet] - 10https://gerrit.wikimedia.org/r/519259 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm)
[08:21:43] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10Gilles) a:05Gilles→03ayounsi
[08:26:18] <wikibugs>	 (03PS4) 10Gehel: icinga: fix zero division error for mjolnir bulk update alert [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) (owner: 10Mathew.onipe)
[08:27:49] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] icinga: fix zero division error for mjolnir bulk update alert [puppet] - 10https://gerrit.wikimedia.org/r/519201 (https://phabricator.wikimedia.org/T225904) (owner: 10Mathew.onipe)
[08:28:08] <gehel>	 onimisionipe: ^ I'll let you check
[08:31:27] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: toolforge: configure kubernetes node using TLS instead of token auth [puppet] - 10https://gerrit.wikimedia.org/r/519259 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm)
[08:33:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: configure kubernetes node using TLS instead of token auth [puppet] - 10https://gerrit.wikimedia.org/r/519259 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm)
[08:44:02] <wikibugs>	 10Operations, 10Community-Relations, 10Traffic, 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10ArielGlenn) The 03-03:35 incidents were likely related to n...
[08:49:29] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: avoid saturday [puppet] - 10https://gerrit.wikimedia.org/r/519361 (https://phabricator.wikimedia.org/T226035)
[08:54:36] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_purge: avoid saturday [puppet] - 10https://gerrit.wikimedia.org/r/519361 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey)
[08:56:48] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Gilles) I've seen this one stuck in poolcounter throttling for a while, it's definite...
[08:58:45] <wikibugs>	 (03PS7) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197)
[09:06:00] <wikibugs>	 (03PS1) 10Hashar: Add git buildpackage configuration [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519363
[09:08:57] <wikibugs>	 (03PS1) 10Gilles: Renew origin trial tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519364
[09:10:25] <wikibugs>	 (03CR) 10Hashar: "The CI job uses git buildpackage under the hood, seems that made the build work!" [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519363 (owner: 10Hashar)
[09:11:57] <wikibugs>	 (03CR) 10Gilles: [C: 03+2] Renew origin trial tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519364 (owner: 10Gilles)
[09:13:09] <wikibugs>	 (03Merged) 10jenkins-bot: Renew origin trial tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519364 (owner: 10Gilles)
[09:13:24] <wikibugs>	 (03CR) 10jenkins-bot: Renew origin trial tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519364 (owner: 10Gilles)
[09:17:27] <XioNoX>	 !log rollback AMS-IX special routing
[09:19:49] <icinga-wm>	 PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27081 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[09:20:58] <logmsgbot>	 !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Renew origin trial tokens (duration: 00m 59s)
[09:25:36] <dcausse>	 elastic1025 is moving a commonswiki shard away, it should recover in a few
[09:28:51] <wikibugs>	 10Operations, 10observability, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10Peachey88)
[09:32:38] <wikibugs>	 (03PS1) 10Elukey: Add more granularity to query/time|size buckets [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/519365 (https://phabricator.wikimedia.org/T226035)
[09:34:17] <icinga-wm>	 RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[09:35:27] <wikibugs>	 10Operations, 10observability, 10Goal, 10User-fgiunchedi: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220104 (10fgiunchedi)
[09:41:13] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Request for information about hosting services for WM-ES - https://phabricator.wikimedia.org/T211414 (10jcrespo) Dzhan and others answered at T211414#4822356 T211414#4805585, is that enough/does that answer your questions? I suggest if you need further information...
[09:44:45] <wikibugs>	 10Operations, 10Traffic: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10akosiaris) 05Open→03Stalled p:05Normal→03Low OK, good to know. Moving to Low priority and Stalled status until then.
[09:47:11] <librenms-wmf>	 08Warning Alert for device cr2-esams.wikimedia.org - Memory over 85%
[09:47:24] <wikibugs>	 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe)
[09:50:03] <XioNoX>	 will look into ^
[09:50:41] <XioNoX>	 nothing urgent, it's at 86%, slowly raising since a very long time ago
[10:02:18] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10jbond) @Nuria I have checked with moritz and cn=wmf should be all that is required for access to turnilo.    @ACraze I have checked the logs on...
[10:02:27] <icinga-wm>	 PROBLEM - HHVM rendering on mw1320 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:02:31] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1320 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:03:55] <icinga-wm>	 RECOVERY - HHVM rendering on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 74935 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:03:59] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:04:35] <wikibugs>	 (03PS1) 10Revi: Add Portal Namespace to VisualEditor option on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519367 (https://phabricator.wikimedia.org/T224813)
[10:05:32] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Have the Swift rewrite proxy renew expiry headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[10:06:27] <wikibugs>	 (03PS6) 10Gilles: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661)
[10:06:45] <wikibugs>	 (03CR) 10Effie Mouzeli: "used +2 instead of +1, sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[10:07:28] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Increase swift proxy connection timeout to 1s [puppet] - 10https://gerrit.wikimedia.org/r/518658 (https://phabricator.wikimedia.org/T226373) (owner: 10Gilles)
[10:07:37] <wikibugs>	 (03PS2) 10Effie Mouzeli: Increase swift proxy connection timeout to 1s [puppet] - 10https://gerrit.wikimedia.org/r/518658 (https://phabricator.wikimedia.org/T226373) (owner: 10Gilles)
[10:11:39] <wikibugs>	 (03Abandoned) 10MarcoAurelio: DNM JENKINS TEST [debs/file-read-backwards] - 10https://gerrit.wikimedia.org/r/519209 (owner: 10MarcoAurelio)
[10:16:08] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/17144/ms-fe1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[10:16:20] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[10:16:30] <wikibugs>	 (03PS7) 10Effie Mouzeli: Have the Swift rewrite proxy renew expiry headers [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[10:21:29] <wikibugs>	 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Qgil)
[10:21:33] <_joe_>	 gehel: I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/518705
[10:22:33] <wikibugs>	 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10jbond) @spatton i dont see Camille de Nes  on either [[https://office.wikimedia.org/wiki/Contact_list | their c...
[10:23:52] <wikibugs>	 (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::ui: configure hive for hue [puppet] - 10https://gerrit.wikimedia.org/r/519368 (https://phabricator.wikimedia.org/T212259)
[10:23:56] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: lvs::configuration: use a meaningful request to monitor wdqs [puppet] - 10https://gerrit.wikimedia.org/r/518705
[10:24:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::ui: configure hive for hue [puppet] - 10https://gerrit.wikimedia.org/r/519368 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey)
[10:25:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "I thought this had already been merged, I'll take care of deploying it carefully." [puppet] - 10https://gerrit.wikimedia.org/r/518705 (owner: 10Giuseppe Lavagetto)
[10:25:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 3:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[10:25:54] <_joe_>	 sigh
[10:25:55] <_joe_>	 again
[10:26:03] <wikibugs>	 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10MoritzMuehlenhoff) @jbond: These staff pages are often slow get updated by T&C (or whoever keeps them updated),...
[10:26:05] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: lvs::configuration: use a meaningful request to monitor wdqs [puppet] - 10https://gerrit.wikimedia.org/r/518705
[10:28:33] <_joe_>	 !log progressively restarting pybal in codfw, eqiad to pick up the change in monitoring for wdqs
[10:28:34] <volans>	 stashbot is missing
[10:28:41] <volans>	 _joe_: will not be logged
[10:28:52] * _joe_ shrugs
[10:29:41] <volans>	 !log-not-log restarting tcpircbot-logmsgbot on icinga1001
[10:29:58] <_joe_>	 !log progressively restarting pybal in codfw, eqiad to pick up the change in monitoring for wdqs
[10:30:07] <volans>	 !log restarted tcpircbot-logmsgbot on icinga1001, was not !log-ing since 01:11 UTC this morning
[10:30:11] <_joe_>	 thanks volans
[10:30:14] <_joe_>	 if it works
[10:30:16] <volans>	 let's see if it works first
[10:30:19] <_joe_>	 it doesn't seem to
[10:31:22] <volans>	 :(
[10:32:22] <volans>	 last logged line is from freenode-connect Welcome to freenode
[10:34:46] <wikibugs>	 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10Elitre) Sorry for asking. Is this related to T226048?
[10:35:44] <moritzm>	 !log updated buster d-i image to release candidate 2
[10:36:36] <wikibugs>	 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10jbond) 05Open→03Resolved a:03jbond >>! In T226614#5288539, @MoritzMuehlenhoff wrote: > @jbond: These staf...
[10:36:59] <volans>	 I'm totally fried by the heat... the wikiteck page redirected me, I restarted the wrong bot, on it
[10:38:49] <wikibugs>	 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) >>! In T226048#5279710, @Kri...
[10:39:46] <volans>	 !log restarted stashbot on toolforge was not !log-ing since 01:11 UTC this morning
[10:39:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:53] <volans>	 yay, it's back
[10:40:10] <_joe_>	 !log progressively restarting pybal in codfw, eqiad to pick up the change in monitoring for wdqs
[10:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:28] <gehel>	 _joe_: thanks! On lunch break, but scream if you need me
[10:47:36] <_joe_>	 nah it's fine
[10:47:49] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add data types to k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519369
[10:47:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Use more specific data types in k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519370
[10:48:02] <moritzm>	 !log updated buster d-i image to release candidate 2
[10:48:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:30] <jijiki>	 !log Rolling restart ms-fe* proxy services for T226373 and T211661
[10:48:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:37] <stashbot>	 T226373: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373
[10:48:37] <stashbot>	 T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661
[10:54:23] <wikibugs>	 (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::ui: add client config [puppet] - 10https://gerrit.wikimedia.org/r/519373 (https://phabricator.wikimedia.org/T226698)
[10:57:24] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10jbond) @Jpita  I can see your account in OID's LDAP   ` uid: jpita-ctr mail: jpita-ctr@wikimedia.org ` however the jpita developer account i see is registered to a gmail address.  You wil...
[10:57:41] <wikibugs>	 10Operations: HTTP 503 on zh.wikipedia.org - https://phabricator.wikimedia.org/T226685 (10ema) 05Open→03Resolved a:03ema This 503 error was due to network issues in eqiad as mentioned by @Marostegui and @Antigng.
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T1100).
[11:00:04] <jouncebot>	 alaa_wmde, Urbanecm, and revi: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:16] <revi>	 hoi~hoi
[11:01:15] <revi>	 :O
[11:03:59] <revi>	 I'll be around till next hour so if someone catches me before the slot time runs out... that's fine to me
[11:05:18] <wikibugs>	 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Gilles) Error rate hasn't gone down at all, now we're just getting errors that time out at 1s instead of 0.5s...  ` Jun 27 11:0...
[11:09:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While this change seems correct, given how puppet works we need to break this down in different, progressive changes. Specifically:" [puppet] - 10https://gerrit.wikimedia.org/r/514226 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko)
[11:09:59] <wikibugs>	 (03CR) 10Gilles: Have the Swift rewrite proxy renew expiry headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[11:10:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: nova-compute: use puppet certs for libvirt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott)
[11:14:50] <revi>	 Amir1 or Urbanecm: are you there?
[11:15:17] <Amir1>	 I'm around.
[11:15:25] <Amir1>	 I can deploy yours
[11:15:28] <revi>	 :)
[11:15:59] <wikibugs>	 10Operations, 10observability, 10Graphite: uwsgi-graphite-web.service not functional after reboots of Graphite hosts - https://phabricator.wikimedia.org/T226694 (10jbond) p:05Triage→03Normal
[11:16:13] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519367 (https://phabricator.wikimedia.org/T224813) (owner: 10Revi)
[11:17:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add Portal Namespace to VisualEditor option on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519367 (https://phabricator.wikimedia.org/T224813) (owner: 10Revi)
[11:17:29] <wikibugs>	 (03CR) 10jenkins-bot: Add Portal Namespace to VisualEditor option on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519367 (https://phabricator.wikimedia.org/T224813) (owner: 10Revi)
[11:17:41] <wikibugs>	 (03CR) 10Ladsgroup: "We haven't finished migrating all properties on test wikidata to the new term store. This has to be done first, otherwise it'll lack value" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519211 (https://phabricator.wikimedia.org/T225053) (owner: 10Alaa Sarhan)
[11:18:19] <Amir1>	 revi: It's live on mwdebug1002
[11:18:25] <revi>	 {{doing}}
[11:19:16] <wikibugs>	 (03PS1) 10Gilles: Only apply expiry logic to "thumb" zone [puppet] - 10https://gerrit.wikimedia.org/r/519374 (https://phabricator.wikimedia.org/T211661)
[11:19:24] <revi>	 {{confirmed}}
[11:19:29] <Amir1>	 Urbanecm: around for your deployment?
[11:19:58] <wikibugs>	 (03CR) 10Gilles: Have the Swift rewrite proxy renew expiry headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[11:20:30] <Amir1>	 revi: going live
[11:21:18] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:519367|Add Portal Namespace to VisualEditor option on kowiki (T224813)]] (duration: 00m 57s)
[11:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:23] <stashbot>	 T224813: Enable VisualEditor for Portal Namespace on Korean Wikipedia - https://phabricator.wikimedia.org/T224813
[11:21:25] <Amir1>	 revi: ^
[11:21:29] <revi>	 kk
[11:22:16] <revi>	 Verified +2
[11:22:39] <revi>	 awesome Amir1 :D
[11:23:10] <Amir1>	 \o/
[11:23:18] <Amir1>	 !log EU SWAT is done
[11:23:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:35] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Jpita) @jbond https://wikitech.wikimedia.org/wiki/User:Jose_pita is that ok?
[11:28:39] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10jbond) 05Open→03Resolved a:03jbond @Jpita thanks i have added that account to the wmf group, you should be able to login to logstash now, please re-open if you are still having prob...
[11:29:37] <wikibugs>	 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10jijiki)
[11:32:36] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to logstash for jpita - https://phabricator.wikimedia.org/T226091 (10Jpita) @jbond it works, thanks for the help
[11:34:57] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+1] Don't show cannot publish error to 'sysop' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic)
[11:35:51] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: k8s: kubelet: replace require with a warning [puppet] - 10https://gerrit.wikimedia.org/r/519375 (https://phabricator.wikimedia.org/T215531)
[11:38:11] <wikibugs>	 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki) @awight I will rollout the new version to production today
[11:44:31] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: node: fix template path [puppet] - 10https://gerrit.wikimedia.org/r/519376 (https://phabricator.wikimedia.org/T215531)
[11:45:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: node: fix template path [puppet] - 10https://gerrit.wikimedia.org/r/519376 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez)
[12:00:57] <wikibugs>	 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Joe) Do we have metrics on the swift backends open connections / connections queues? without such information,...
[12:01:43] <Amir1>	 !log start of mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=testwikidatawiki --batch-size=100 --sleep=3 (T225052)
[12:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:49] <stashbot>	 T225052: Run Property Terms Rebuild script - https://phabricator.wikimedia.org/T225052
[12:01:58] <wikibugs>	 (03PS1) 10Gilles: Serve JPG when WEBP conversion fails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707)
[12:04:02] <wikibugs>	 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10jijiki) @Joe I will start a more thorough investigation the following days, we'll see what will come up
[12:08:39] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] Serve JPG when WEBP conversion fails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles)
[12:10:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519240 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond)
[12:10:46] <wikibugs>	 (03PS2) 10Jbond: icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519240 (https://phabricator.wikimedia.org/T226508)
[12:17:05] <wikibugs>	 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Trizek-WMF) >>! In T226048#5288560,...
[12:17:24] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) 05Open→03Resolved a:03Joe The immediate problem seems to be resolved given we've not see corrup...
[12:28:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519234 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond)
[12:28:08] <wikibugs>	 (03PS5) 10Jbond: icinga user agent: add custom user agent to icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/519234 (https://phabricator.wikimedia.org/T226508)
[12:36:34] <wikibugs>	 (03PS3) 10Jbond: missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051
[12:37:11] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: k8s: kubelet: stop requiring ::k8s::infrastructure_config [puppet] - 10https://gerrit.wikimedia.org/r/519375 (https://phabricator.wikimedia.org/T215531)
[12:40:23] <wikibugs>	 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) @Marostegui, which wikis are affected? Only English Wikipedia? Do you need to display a banner too?
[12:41:11] <wikibugs>	 (03PS11) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459
[12:41:28] <wikibugs>	 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) >>! In T226358#5288891, @Trizek-WMF wrote: > @Marostegui, which wikis are affected? Only English Wikipedia? > Do you nee...
[12:42:11] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Memory over 85%
[12:42:14] <wikibugs>	 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) Thank you! :)
[12:42:22] <wikibugs>	 (03CR) 10Jbond: icinga: Add a script to parse and query the status.dat file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond)
[12:47:24] <wikibugs>	 10Operations, 10Analytics, 10Fundraising-Backlog, 10LDAP-Access-Requests, 10Wikimedia-Fundraising: Turnilo access for Camille de Nes (Advancement) - https://phabricator.wikimedia.org/T226614 (10spatton) Thanks @jbond and @MoritzMuehlenhoff!
[12:47:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, specially because we are about to get rid of Jessie. But I wonder if we will have to undo this factorization again when we reach Bus" [puppet] - 10https://gerrit.wikimedia.org/r/519268 (owner: 10Andrew Bogott)
[12:56:06] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, last call for an external reviewer (given both John and me wrote this)" [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond)
[13:00:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::ui: add client config [puppet] - 10https://gerrit.wikimedia.org/r/519373 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[13:00:29] <wikibugs>	 (03PS2) 10Elukey: role::analytics_test_cluster::hadoop::ui: add client config [puppet] - 10https://gerrit.wikimedia.org/r/519373 (https://phabricator.wikimedia.org/T226698)
[13:02:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/519374 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[13:04:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: Have the Swift rewrite proxy renew expiry headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/518226 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[13:06:53] <wikibugs>	 (03PS1) 10Elukey: Move analytics client profile from hue to druid [puppet] - 10https://gerrit.wikimedia.org/r/519390 (https://phabricator.wikimedia.org/T226698)
[13:07:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, nonblocking nit inline, feel free to ignore" (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles)
[13:07:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Move analytics client profile from hue to druid [puppet] - 10https://gerrit.wikimedia.org/r/519390 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[13:11:48] <godog>	 !log depool restbase10(0[7-9]|1[0-5]) before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/513262
[13:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:42] <wikibugs>	 (03PS1) 10Elukey: role::druid::test_analytics::worker: add hive config [puppet] - 10https://gerrit.wikimedia.org/r/519393 (https://phabricator.wikimedia.org/T226698)
[13:13:07] <wikibugs>	 (03PS3) 10Filippo Giunchedi: RESTBase: Remove restbase10(0[7-9]|1[0-5]) and set them as spares [puppet] - 10https://gerrit.wikimedia.org/r/513262 (https://phabricator.wikimedia.org/T223976) (owner: 10Mobrovac)
[13:14:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] RESTBase: Remove restbase10(0[7-9]|1[0-5]) and set them as spares [puppet] - 10https://gerrit.wikimedia.org/r/513262 (https://phabricator.wikimedia.org/T223976) (owner: 10Mobrovac)
[13:15:18] <elukey>	 !log start druid drop datasource test - might affect AQS - T226035
[13:15:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:23] <stashbot>	 T226035: Dropping data from druid takes down aqs hosts  - https://phabricator.wikimedia.org/T226035
[13:21:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: add hive config [puppet] - 10https://gerrit.wikimedia.org/r/519393 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[13:22:01] <wikibugs>	 (03PS2) 10Elukey: role::druid::test_analytics::worker: add hive config [puppet] - 10https://gerrit.wikimedia.org/r/519393 (https://phabricator.wikimedia.org/T226698)
[13:24:15] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10fgiunchedi)
[13:26:33] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): dologmsg: add manpage [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244)
[13:26:44] <XioNoX>	 !log push RPKI classification test to cr4-ulsfo - T220669
[13:26:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:49] <stashbot>	 T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669
[13:28:20] <wikibugs>	 (03PS1) 10Elukey: role::druid::test_analytics::worker: add other hadoop client config [puppet] - 10https://gerrit.wikimedia.org/r/519395 (https://phabricator.wikimedia.org/T226698)
[13:28:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: add other hadoop client config [puppet] - 10https://gerrit.wikimedia.org/r/519395 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[13:29:51] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "I tried to rebase it, but someone should definitely test it with the puppet compiler, I’m not sure if the paths are still correct. (I trie" [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE))
[13:30:47] <wikibugs>	 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Trizek-WMF) Banner set. It will be displayed starting at 05:00 UTC July 3 on all wikis. End at 06:20 UTC.
[13:31:30] <wikibugs>	 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) Thank you!
[13:34:06] <icinga-wm>	 PROBLEM - Restbase root url on restbase1010 is CRITICAL: connect to address 10.64.0.112 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[13:34:18] <icinga-wm>	 PROBLEM - Restbase root url on restbase1011 is CRITICAL: connect to address 10.64.0.113 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[13:34:22] <icinga-wm>	 PROBLEM - Restbase root url on restbase1012 is CRITICAL: connect to address 10.64.32.79 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[13:34:34] <icinga-wm>	 PROBLEM - Restbase root url on restbase1014 is CRITICAL: connect to address 10.64.48.133 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[13:34:40] <icinga-wm>	 PROBLEM - Restbase root url on restbase1013 is CRITICAL: connect to address 10.64.32.80 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[13:34:40] <icinga-wm>	 PROBLEM - Restbase root url on restbase1007 is CRITICAL: connect to address 10.64.0.223 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[13:34:48] <icinga-wm>	 PROBLEM - Restbase root url on restbase1009 is CRITICAL: connect to address 10.64.48.110 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[13:35:00] <icinga-wm>	 PROBLEM - Restbase root url on restbase1008 is CRITICAL: connect to address 10.64.32.178 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[13:35:00] <icinga-wm>	 PROBLEM - Restbase root url on restbase1015 is CRITICAL: connect to address 10.64.48.134 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[13:35:07] <elukey>	 mobrovac: --^
[13:35:25] <elukey>	 anything ongoing at the moment?
[13:35:32] <mobrovac>	 known elukey, these are being decommed
[13:35:42] <elukey>	 fiuuuu
[13:35:44] <elukey>	 thanks :)
[13:35:46] <mobrovac>	 :)
[13:35:48] <mobrovac>	 i'll ack
[13:36:40] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add data types to k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519369
[13:36:42] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Use more specific data types in k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519370
[13:36:44] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes: Move k8s::infrastructure_config to profile [puppet] - 10https://gerrit.wikimedia.org/r/519398
[13:38:37] <godog>	 uugh thanks, I guess I was too eager
[13:38:42] <godog>	 running puppet on icinga
[13:39:02] <volans>	 the decom cookbook does that, was it not run?
[13:39:18] <elukey>	 nobody likes cookbooks!
[13:39:21] * elukey runs away
[13:40:27] <godog>	 volans: which one? I followed https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Remove_from_production
[13:40:58] <volans>	 ah is still in the phase of getting out of prod, not decom yet
[13:41:41] <wikibugs>	 (03PS1) 10Elukey: role::druid::test_analytics::worker: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/519400 (https://phabricator.wikimedia.org/T226698)
[13:41:48] <volans>	 it's mentioned later on
[13:42:04] <volans>	 and we have a bunch of improvements coming up as a follow up of a session at the sre summit
[13:42:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/519400 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[13:43:05] <XioNoX>	 !log push RPKI classification test to cr3-ulsfo - T220669
[13:43:06] <godog>	 that's awesome
[13:43:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:10] <stashbot>	 T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669
[13:43:41] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10mobrovac)
[13:44:24] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10mobrovac) 05Open→03Resolved a:03mobrovac
[13:46:28] <wikibugs>	 (03PS1) 10Ema: cache: double appservers connection limit [puppet] - 10https://gerrit.wikimedia.org/r/519401
[13:51:11] <wikibugs>	 (03PS2) 10Ema: cache: double appservers and api connection limit [puppet] - 10https://gerrit.wikimedia.org/r/519401
[13:51:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17146/ says noop in production and is effectively a noop, some merging" [puppet] - 10https://gerrit.wikimedia.org/r/519369 (owner: 10Alexandros Kosiaris)
[13:52:10] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Add data types to k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519369
[13:52:12] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Use more specific data types in k8s module [puppet] - 10https://gerrit.wikimedia.org/r/519370
[13:52:14] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kubernetes: Move k8s::infrastructure_config to profile [puppet] - 10https://gerrit.wikimedia.org/r/519398
[13:55:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17146/ says noop in production and is effectively a noop, merging" [puppet] - 10https://gerrit.wikimedia.org/r/519370 (owner: 10Alexandros Kosiaris)
[14:05:56] <wikibugs>	 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) All our San Francisco POP now have a `validation-state` on its received prefixes. Next step is to push it to all the sites.
[14:07:02] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond)
[14:07:17] <wikibugs>	 (03PS1) 10Ema: cache: reimage cp2002 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519404 (https://phabricator.wikimedia.org/T226637)
[14:08:48] <wikibugs>	 (03PS1) 10Elukey: role::druid::test_analytics::worker: remove hadoop client config [puppet] - 10https://gerrit.wikimedia.org/r/519405 (https://phabricator.wikimedia.org/T226698)
[14:09:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: remove hadoop client config [puppet] - 10https://gerrit.wikimedia.org/r/519405 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[14:10:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "As I was writing this I thought, "Hope we don't have to revert all this for Buster!"  Time will tell :/" [puppet] - 10https://gerrit.wikimedia.org/r/519268 (owner: 10Andrew Bogott)
[14:10:16] <wikibugs>	 (03PS5) 10Andrew Bogott: nova-compute: consolidate a bunch of code that isn't distro-specific [puppet] - 10https://gerrit.wikimedia.org/r/519268
[14:10:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond)
[14:11:03] <wikibugs>	 (03PS12) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459
[14:11:32] <ema>	 !log depool cp2002 and reimage as upload_ats T226637
[14:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:37] <stashbot>	 T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637
[14:12:03] <wikibugs>	 (03PS6) 10Andrew Bogott: nova-compute: consolidate a bunch of code that isn't distro-specific [puppet] - 10https://gerrit.wikimedia.org/r/519268
[14:13:01] <wikibugs>	 (03PS2) 10Ema: cache: reimage cp2002 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519404 (https://phabricator.wikimedia.org/T226637)
[14:13:52] <XioNoX>	 !log push RPKI classification test to eqord - T220669
[14:13:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:57] <stashbot>	 T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669
[14:13:58] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: reimage cp2002 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519404 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema)
[14:16:18] <wikibugs>	 (03PS7) 10Andrew Bogott: nova-compute: consolidate a bunch of code that isn't distro-specific [puppet] - 10https://gerrit.wikimedia.org/r/519268
[14:16:19] <wikibugs>	 (03PS3) 10Andrew Bogott: nova-compute: remove libvirt_type params [puppet] - 10https://gerrit.wikimedia.org/r/519275
[14:16:22] <wikibugs>	 (03PS13) 10Andrew Bogott: nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484)
[14:16:23] <wikibugs>	 (03PS2) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315
[14:17:47] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2002.codfw.wmnet'] ` The log can be found in `...
[14:19:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: remove libvirt_type params [puppet] - 10https://gerrit.wikimedia.org/r/519275 (owner: 10Andrew Bogott)
[14:23:58] <Reedy>	 !log running `mwscript extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --wiki=commonswiki --audio --missing --throttle` in screen as me on mwmaint1002 T226713
[14:24:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:03] <stashbot>	 T226713: Run cleanupTranscodes.php for current midi files - https://phabricator.wikimedia.org/T226713
[14:24:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/compiler1002/17148/ says noop" [puppet] - 10https://gerrit.wikimedia.org/r/519398 (owner: 10Alexandros Kosiaris)
[14:28:17] <XioNoX>	 !log push RPKI classification to Dallas - T220669
[14:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:22] <stashbot>	 T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669
[14:28:44] <wikibugs>	 (03PS1) 10Elukey: role::druid::test_analytics::worker: enable kerberos [puppet] - 10https://gerrit.wikimedia.org/r/519408 (https://phabricator.wikimedia.org/T226698)
[14:28:49] <wikibugs>	 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10fgiunchedi) >>! In T226373#5288615, @Gilles wrote: > Error rate hasn't gone down at all, now we're just gettin...
[14:28:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: use puppet certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/519276 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott)
[14:29:24] <wikibugs>	 (03PS2) 10Elukey: role::druid::test_analytics::worker: enable kerberos [puppet] - 10https://gerrit.wikimedia.org/r/519408 (https://phabricator.wikimedia.org/T226698)
[14:30:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: enable kerberos [puppet] - 10https://gerrit.wikimedia.org/r/519408 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[14:30:53] <wikibugs>	 (03CR) 10Herron: "akosiaris got time for one more calico update?" [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[14:32:34] <wikibugs>	 (03PS1) 10Cwhite: grafana: remove legacy varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/519410 (https://phabricator.wikimedia.org/T184942)
[14:32:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> akosiaris got time for one more calico update?" [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[14:33:17] <akosiaris>	 !log push newer calico outgoing policy rules. T225005
[14:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:21] <stashbot>	 T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005
[14:33:31] <herron>	 thank you!
[14:33:40] <wikibugs>	 (03PS3) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315
[14:33:42] <wikibugs>	 (03PS1) 10Andrew Bogott: nova-compute: fix dependency for cacert.pem [puppet] - 10https://gerrit.wikimedia.org/r/519411 (https://phabricator.wikimedia.org/T225484)
[14:33:46] <akosiaris>	 yw
[14:33:55] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1025 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[14:34:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: fix dependency for cacert.pem [puppet] - 10https://gerrit.wikimedia.org/r/519411 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott)
[14:35:35] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1019 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[14:36:32] <wikibugs>	 (03PS4) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315
[14:36:34] <wikibugs>	 (03PS1) 10Andrew Bogott: nova-compute: fix dependency for cacert.pem [puppet] - 10https://gerrit.wikimedia.org/r/519412 (https://phabricator.wikimedia.org/T225484)
[14:37:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: fix dependency for cacert.pem [puppet] - 10https://gerrit.wikimedia.org/r/519412 (https://phabricator.wikimedia.org/T225484) (owner: 10Andrew Bogott)
[14:37:53] <wikibugs>	 (03PS1) 10Ema: cache_upload codfw: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/519414 (https://phabricator.wikimedia.org/T226637)
[14:38:25] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[14:39:19] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1025 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[14:40:39] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[14:40:57] <icinga-wm>	 PROBLEM - puppet last run on cloudvirtan1005 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[14:42:11] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[14:43:00] <wikibugs>	 (03PS2) 10Herron: kafka-main2001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519021
[14:43:20] <herron>	 !log beginning replacement of kafka2001 with kafka-main2001 T225005
[14:43:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:26] <stashbot>	 T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005
[14:46:00] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafka-main2001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519021 (owner: 10Herron)
[14:46:03] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1008 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[14:48:18] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2002.codfw.wmnet'] ` The log can be found in `...
[14:53:33] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: k8s: Make $extra_params type an array of strings [puppet] - 10https://gerrit.wikimedia.org/r/519418
[14:55:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Make $extra_params type an array of strings [puppet] - 10https://gerrit.wikimedia.org/r/519418 (owner: 10Alexandros Kosiaris)
[15:03:48] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[15:05:13] <wikibugs>	 (03PS1) 10Ema: Revert "install_server: Do not re-image db1133 and dbproxy" [puppet] - 10https://gerrit.wikimedia.org/r/519419
[15:05:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: use buster for centrallog1001 [puppet] - 10https://gerrit.wikimedia.org/r/519420 (https://phabricator.wikimedia.org/T200706)
[15:05:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "install_server: Do not re-image db1133 and dbproxy" [puppet] - 10https://gerrit.wikimedia.org/r/519421
[15:05:57] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Revert "install_server: Do not re-image db1133 and dbproxy" [puppet] - 10https://gerrit.wikimedia.org/r/519421
[15:06:20] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:06:28] <wikibugs>	 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Development services), and 3 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10MoritzMuehlenhoff) @mmodell I've built an interim package on our package...
[15:06:40] <wikibugs>	 (03PS1) 10Elukey: druid: ensure that the druid user is in the pupept catalog [puppet] - 10https://gerrit.wikimedia.org/r/519422 (https://phabricator.wikimedia.org/T226698)
[15:06:40] <icinga-wm>	 RECOVERY - puppet last run on cloudvirtan1005 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[15:06:40] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1019 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:06:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "Per Trizek's request, beta only, so it can go out at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan)
[15:07:05] <wikibugs>	 (03PS3) 10Urbanecm: Betalabs: Enable GrowthExperiments features for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan)
[15:07:08] <wikibugs>	 (03PS2) 10Herron: kafka-main: replace kafka2001 hardware with kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005)
[15:07:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "install_server: Do not re-image db1133 and dbproxy" [puppet] - 10https://gerrit.wikimedia.org/r/519421 (owner: 10Alexandros Kosiaris)
[15:07:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "Per Trizek's request, beta only, so it can go out at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan)
[15:08:15] <wikibugs>	 (03Merged) 10jenkins-bot: Betalabs: Enable GrowthExperiments features for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan)
[15:08:17] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] druid: ensure that the druid user is in the pupept catalog [puppet] - 10https://gerrit.wikimedia.org/r/519422 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[15:08:30] <wikibugs>	 (03CR) 10jenkins-bot: Betalabs: Enable GrowthExperiments features for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518109 (https://phabricator.wikimedia.org/T226205) (owner: 10Kosta Harlan)
[15:08:42] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafka-main: replace kafka2001 hardware with kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[15:08:49] <wikibugs>	 (03PS3) 10Herron: kafka-main: replace kafka2001 hardware with kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005)
[15:09:05] <wikibugs>	 (03PS2) 10Elukey: druid: ensure that the druid user is in the pupept catalog [puppet] - 10https://gerrit.wikimedia.org/r/519422 (https://phabricator.wikimedia.org/T226698)
[15:09:11] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T223825 (10jijiki)
[15:09:15] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10jijiki)
[15:09:36] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] druid: ensure that the druid user is in the pupept catalog [puppet] - 10https://gerrit.wikimedia.org/r/519422 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[15:10:33] <wikibugs>	 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) To clarify, Trizek isn't ask...
[15:11:23] <wikibugs>	 (03PS4) 10Herron: kafka-main: replace kafka2001 hardware with kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/519273 (https://phabricator.wikimedia.org/T225005)
[15:11:55] <wikibugs>	 (03PS5) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315
[15:11:58] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:12:27] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2002.codfw.wmnet'] ` The log can be found in `...
[15:12:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use buster for centrallog1001 [puppet] - 10https://gerrit.wikimedia.org/r/519420 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi)
[15:13:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: install_server: use buster for centrallog1001 [puppet] - 10https://gerrit.wikimedia.org/r/519420 (https://phabricator.wikimedia.org/T200706)
[15:13:12] <wikibugs>	 (03PS6) 10Andrew Bogott: libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315
[15:14:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] libvirtd: turn on listening [puppet] - 10https://gerrit.wikimedia.org/r/519315 (owner: 10Andrew Bogott)
[15:16:14] <wikibugs>	 (03PS3) 10Filippo Giunchedi: install_server: use buster for centrallog1001 [puppet] - 10https://gerrit.wikimedia.org/r/519420 (https://phabricator.wikimedia.org/T200706)
[15:24:20] <wikibugs>	 (03PS1) 10ArielGlenn: tiny bit more error reporting when we do stubs/pagelogs/abstracts [dumps] - 10https://gerrit.wikimedia.org/r/519427 (https://phabricator.wikimedia.org/T226659)
[15:26:06] <wikibugs>	 10Operations, 10Space, 10Wikimedia-Mailing-lists: Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil)
[15:26:45] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil)
[15:34:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Record extended MOU for nathante [puppet] - 10https://gerrit.wikimedia.org/r/519429
[15:37:10] <wikibugs>	 (03PS1) 10Andrew Bogott: libvirt: remove some source files that aren't actually installed [puppet] - 10https://gerrit.wikimedia.org/r/519431
[15:37:37] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Qgil) Sorry, it took more time than I expected when I posted the comment above, but here it is: {T226727} & https://discuss-space.wmflabs.org/t/inte...
[15:39:39] <wikibugs>	 (03PS2) 10Andrew Bogott: libvirt: remove some source files that aren't actually installed [puppet] - 10https://gerrit.wikimedia.org/r/519431
[15:40:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] libvirt: remove some source files that aren't actually installed [puppet] - 10https://gerrit.wikimedia.org/r/519431 (owner: 10Andrew Bogott)
[15:40:50] <wikibugs>	 (03PS1) 10BPirkle: Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099)
[15:41:36] <wikibugs>	 (03PS2) 10Muehlenhoff: Record extended MOU for nathante [puppet] - 10https://gerrit.wikimedia.org/r/519429
[15:41:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[15:43:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Record extended MOU for nathante [puppet] - 10https://gerrit.wikimedia.org/r/519429 (owner: 10Muehlenhoff)
[15:43:20] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[15:43:26] <wikibugs>	 (03PS2) 10BPirkle: Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099)
[15:44:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle)
[15:44:30] <wikibugs>	 (03PS1) 10Jbond: mailman: rename mailing list [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367)
[15:45:17] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] tiny bit more error reporting when we do stubs/pagelogs/abstracts [dumps] - 10https://gerrit.wikimedia.org/r/519427 (https://phabricator.wikimedia.org/T226659) (owner: 10ArielGlenn)
[15:46:34] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "lgtm, one nitpick" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367) (owner: 10Jbond)
[15:48:11] <wikibugs>	 (03PS2) 10Jbond: mailman: rename mailing list [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367)
[15:48:26] <wikibugs>	 (03CR) 10Jbond: mailman: rename mailing list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367) (owner: 10Jbond)
[15:48:34] <wikibugs>	 (03PS3) 10Jbond: mailman: rename mailing list [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367)
[15:49:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mailman: rename mailing list [puppet] - 10https://gerrit.wikimedia.org/r/519434 (https://phabricator.wikimedia.org/T218367) (owner: 10Jbond)
[15:50:16] <wikibugs>	 (03PS1) 10CDanis: phaste: set a User-Agent in line with WMF policy [puppet] - 10https://gerrit.wikimedia.org/r/519435
[15:50:18] <wikibugs>	 (03PS1) 10CDanis: phaste: add argument parsing and --title [puppet] - 10https://gerrit.wikimedia.org/r/519436
[15:50:29] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[15:53:50] <wikibugs>	 (03PS2) 10Jbond: phaste: set a User-Agent in line with WMF policy [puppet] - 10https://gerrit.wikimedia.org/r/519435 (https://phabricator.wikimedia.org/T226508) (owner: 10CDanis)
[15:54:04] <wikibugs>	 (03CR) 10Jbond: "LGTM - added Bug: T226508" [puppet] - 10https://gerrit.wikimedia.org/r/519435 (https://phabricator.wikimedia.org/T226508) (owner: 10CDanis)
[15:54:36] <wikibugs>	 (03PS3) 10CDanis: phaste: set a User-Agent in line with WMF policy [puppet] - 10https://gerrit.wikimedia.org/r/519435 (https://phabricator.wikimedia.org/T226508)
[15:54:48] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] phaste: set a User-Agent in line with WMF policy [puppet] - 10https://gerrit.wikimedia.org/r/519435 (https://phabricator.wikimedia.org/T226508) (owner: 10CDanis)
[15:56:21] <wikibugs>	 (03CR) 10Volans: phaste: set a User-Agent in line with WMF policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519435 (https://phabricator.wikimedia.org/T226508) (owner: 10CDanis)
[15:57:42] <wikibugs>	 (03PS2) 10CDanis: phaste: add argument parsing and --title [puppet] - 10https://gerrit.wikimedia.org/r/519436
[16:00:02] <wikibugs>	 (03PS1) 10Elukey: role::druid::test_analytics::worker: fix some kerberos parameters [puppet] - 10https://gerrit.wikimedia.org/r/519439 (https://phabricator.wikimedia.org/T226698)
[16:00:05] <jouncebot>	 godog and _joe_: #bothumor I � Unicode. All rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T1600).
[16:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:01:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/519436 (owner: 10CDanis)
[16:01:34] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] phaste: add argument parsing and --title [puppet] - 10https://gerrit.wikimedia.org/r/519436 (owner: 10CDanis)
[16:01:40] <cdanis>	 thanks jbond42 !
[16:01:53] <jbond42>	 no problem :)
[16:02:22] <wikibugs>	 (03PS2) 10Elukey: role::druid::test_analytics::worker: fix some kerberos parameters [puppet] - 10https://gerrit.wikimedia.org/r/519439 (https://phabricator.wikimedia.org/T226698)
[16:04:15] <volans>	 thanks for allowing me to avoid publicly +1 that script :-P
[16:04:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: fix some kerberos parameters [puppet] - 10https://gerrit.wikimedia.org/r/519439 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[16:04:44] <jbond42>	 lol
[16:05:23] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10jbond) @Varnent I have now renamed the old list and it is now available at https://lists.wikimedia.org/mailman/listinfo/MoveCom.  The old U...
[16:05:50] <[1997kB]>	 SSE stream, Error: broker transport failure  Code: -195 errno: -195
[16:06:08] * cdanis doing the dirty work so volans doesn't have to <3
[16:06:36] <volans>	 <3
[16:08:44] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ganeti partman: Switch to 100% of lvm guided size [puppet] - 10https://gerrit.wikimedia.org/r/519441 (https://phabricator.wikimedia.org/T224603)
[16:10:12] <wikibugs>	 (03Abandoned) 10Ema: Revert "install_server: Do not re-image db1133 and dbproxy" [puppet] - 10https://gerrit.wikimedia.org/r/519419 (owner: 10Ema)
[16:10:31] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[16:12:26] <wikibugs>	 (03PS1) 10Jcrespo: Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/519442
[16:13:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti partman: Switch to 100% of lvm guided size [puppet] - 10https://gerrit.wikimedia.org/r/519441 (https://phabricator.wikimedia.org/T224603) (owner: 10Alexandros Kosiaris)
[16:16:42] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2002.codfw.wmnet'] `  and were **ALL** successful.
[16:18:55] <wikibugs>	 (03PS2) 10Ema: cache_upload codfw: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/519414 (https://phabricator.wikimedia.org/T226637)
[16:22:22] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache_upload codfw: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/519414 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema)
[16:22:45] <wikibugs>	 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Gerrit admin permissions for Ottomata - https://phabricator.wikimedia.org/T226724 (10Jdforrester-WMF)
[16:24:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T226569 (10Cmjohnson) 05Open→03Resolved @Marostegui disk swapped but this server is out of warranty. I would suggest moving masters to new servers.
[16:24:43] <wikibugs>	 (03PS2) 10Jcrespo: Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/519442
[16:29:33] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Cmjohnson) @Eevans Do you still want to move this server?  Let's coordinate a day/time
[16:31:50] <icinga-wm>	 PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[16:36:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T226569 (10jcrespo) That's the plan. See:   ` root@db1072:~$ megacli -PDList -aALL | grep rro Media Error Count: 0 Other Error Count: 0 Media Error Count: 0 Other Error Count: 3 Media Error Count: 0 Other Error Coun...
[16:37:04] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2009.codfw.wmnet', 'ganeti2010.codfw.wmnet', 'ganeti...
[16:38:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/519442 (owner: 10Jcrespo)
[16:39:11] <ema>	 !log pool cp2002 w/ ATS backend T226637
[16:39:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:17] <stashbot>	 T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637
[16:41:08] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add stretch-wikimedia/component/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519449 (https://phabricator.wikimedia.org/T215975)
[16:42:44] <jynus>	 !log repool labsdb1011 T222978
[16:42:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:48] <stashbot>	 T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978
[16:43:48] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: aptrepo: add stretch-wikimedia/component/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519449 (https://phabricator.wikimedia.org/T215975)
[16:45:39] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: aptrepo: add stretch-wikimedia/thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519449 (https://phabricator.wikimedia.org/T215975)
[16:46:29] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10ACraze) Ahh ok, I'm able to get in to turnilo now, thanks!
[16:53:50] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[16:55:10] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) @WMDE-leszek, @Tarrow.  Any feedback on the comment above?
[16:58:37] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Tarrow) @akosiaris Yep; we've interpreted it as something we really need before exposing it to real traffic. We've got a ticket open about...
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, and halfak: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T1700).
[17:04:20] <icinga-wm>	 RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:06:49] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) >>! In T212189#5289724, @Tarrow wrote: > @akosiaris Yep; we've interpreted it as something we really need before exposing it to...
[17:08:12] <wikibugs>	 (03PS3) 10BPirkle: Add kask session storage configuration. Use only on testwiki, [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099)
[17:09:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/519449 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez)
[17:13:19] <wikibugs>	 (03PS1) 10Volans: dbconfig: structure return values of actions [software/conftool] - 10https://gerrit.wikimedia.org/r/519457
[17:13:21] <wikibugs>	 (03PS1) 10Volans: dbconfig: config diff, on non-empty diff exit 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/519458
[17:13:23] <wikibugs>	 (03PS1) 10Volans: configuration: change IRC default values [software/conftool] - 10https://gerrit.wikimedia.org/r/519459
[17:13:25] <wikibugs>	 (03PS1) 10Volans: dbconfig: allow to remote paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460
[17:13:27] <wikibugs>	 (03PS1) 10Volans: dbconfig: improve config commit and restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519461
[17:17:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add stretch-wikimedia/thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519449 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez)
[17:21:04] <arturo>	 !log imported gpg keys 9DC858229FC7DD38854AE2D88D81803C0EBFCD88 and 54A647F9048D5688D7DA2ABE6A030B21BA07F4FB into install1002 for T215975
[17:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:10] <stashbot>	 T215975: Package/copy kubeadm, kubelet, docker-ce and kubectl to Toolforge Aptly or Reprepro - https://phabricator.wikimedia.org/T215975
[17:23:35] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@da50001]: Use new projects and new config layout T220855, rb2009 only
[17:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:40] <stashbot>	 T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855
[17:25:52] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: fix wrong component name for thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519463 (https://phabricator.wikimedia.org/T215975)
[17:26:13] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@da50001]: Use new projects and new config layout T220855, rb2009 only (duration: 02m 38s)
[17:26:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:29] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata)
[17:27:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:27:55] <mobrovac>	 known ^
[17:28:16] <icinga-wm>	 RECOVERY - MegaRAID on db1072 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:28:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: fix wrong component name for thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519463 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez)
[17:31:38] <wikibugs>	 10Operations, 10SRE-Access-Requests: please re-activate LDAP access for Dzahn - https://phabricator.wikimedia.org/T226744 (10Mutante)
[17:34:42] <Pchelolo>	 restbase alert is known, the node's depooled
[17:35:56] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10jbond) 05Open→03Resolved a:03jbond great, i think this is done now so closing please re open if there is still an issue
[17:37:28] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2013.codfw.wmnet', 'ganeti2014.codfw.wmnet', 'ganeti2012.codfw.wmnet', 'ganeti2010.codfw.wmnet', 'ganeti2009.codfw.wmnet', 'gan...
[17:42:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "This depends on I8f18baa4b34318ac14e8c8e362ea59a1283c52c4, which is merged, but not in wmf.11. Deploying this before July 11 will make thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic)
[17:45:10] <wikibugs>	 10Operations, 10SRE-Access-Requests: please re-activate LDAP access for Dzahn - https://phabricator.wikimedia.org/T226744 (10jbond) this should be fixed now reopen if more is needed or ping me on irc
[17:45:21] <wikibugs>	 10Operations, 10SRE-Access-Requests: please re-activate LDAP access for Dzahn - https://phabricator.wikimedia.org/T226744 (10jbond) 05Open→03Resolved a:03jbond
[17:47:36] <wikibugs>	 (03CR) 10Petar.petkovic: "This can wait until July 11. We don't need to cherry pick I8f18baa4b34318ac14e8c8e362ea59a1283c52c4. Until then, sysops will see the warni" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic)
[17:49:08] * Krinkle staging on mwdebug1002
[17:51:49] <wikibugs>	 (03PS2) 10Urbanecm: Restrict uploading on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil)
[17:53:55] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil)
[17:55:11] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/WikimediaIncubator/includes/WikimediaIncubator.php: T204883 / 93643b44a52ea7 (duration: 01m 00s)
[17:55:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:18] <stashbot>	 T204883: Incubator emits "PHP Notice: Undefined index: realtitle"  - https://phabricator.wikimedia.org/T204883
[17:55:34] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: reprepro config file format fixes for thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519470 (https://phabricator.wikimedia.org/T215531)
[17:56:08] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: aptrepo: reprepro config file format fixes for thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519470 (https://phabricator.wikimedia.org/T215975)
[17:56:48] * Krinkle is done
[17:57:01] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: reprepro config file format fixes for thirdparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519470 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez)
[17:58:17] <wikibugs>	 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Gerrit manager rights for Ottomata - https://phabricator.wikimedia.org/T226724 (10Ottomata)
[17:59:11] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, rb2009 only, fixed mathoid config
[17:59:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:16] <stashbot>	 T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855
[18:00:04] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T1800).
[18:00:04] <jouncebot>	 Urbanecm: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:17] <Urbanecm>	 Let's do the needful then
[18:00:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519257 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm)
[18:00:43] <ottomata>	 jbond42:  can you help with https://phabricator.wikimedia.org/T226724 ?
[18:01:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:01:30] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, rb2009 only, fixed mathoid config (duration: 02m 19s)
[18:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "DNM, notes for personal usage" (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm)
[18:03:12] <wikibugs>	 (03PS3) 10Urbanecm: Revert "Revert "Set default aliases for Project_talk namespace"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519257 (https://phabricator.wikimedia.org/T173070)
[18:03:24] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, restbase1016
[18:03:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:32] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, restbase1016 (duration: 00m 08s)
[18:03:36] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "gate pipeline succeeded, but failed to rebase => manual V+2 and submitting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519257 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm)
[18:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:56] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Revert "Set default aliases for Project_talk namespace"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519257 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm)
[18:04:54] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.decommission
[18:04:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:00] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[18:05:01] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, restbase1016
[18:05:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:04] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `db2039.codfw.wmnet` -  db2039.codfw.wmnet   - Removed from Puppet master and PuppetDB   - Downt...
[18:05:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:09] <stashbot>	 T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855
[18:05:21] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:519257|Revert "Revert "Set default aliases for Project_talk namespace""]] (T173070) (duration: 00m 57s)
[18:05:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:27] <stashbot>	 T173070: Set default aliases for Project_talk namespace - https://phabricator.wikimedia.org/T173070
[18:06:42] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855, restbase1016 (duration: 01m 41s)
[18:06:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:58] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10RobH)
[18:07:28] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: fix missing -e flag in grep-dctrl for thirparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519472 (https://phabricator.wikimedia.org/T215975)
[18:08:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: fix missing -e flag in grep-dctrl for thirparty/kubeadm-k8s [puppet] - 10https://gerrit.wikimedia.org/r/519472 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez)
[18:08:17] <Urbanecm>	 !log running namespaceDupes.php across all wikis in tmux on mwmaint1002 (T173070)
[18:08:21] <wikibugs>	 (03PS1) 10RobH: decom db2039 [puppet] - 10https://gerrit.wikimedia.org/r/519473 (https://phabricator.wikimedia.org/T225988)
[18:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:51] <wikibugs>	 (03PS1) 10RobH: decom db2039 prod dns [dns] - 10https://gerrit.wikimedia.org/r/519474 (https://phabricator.wikimedia.org/T225988)
[18:08:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil)
[18:08:59] <wikibugs>	 (03PS3) 10Urbanecm: Restrict uploading on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil)
[18:09:02] <wikibugs>	 (03CR) 10RobH: [C: 03+2] decom db2039 [puppet] - 10https://gerrit.wikimedia.org/r/519473 (https://phabricator.wikimedia.org/T225988) (owner: 10RobH)
[18:09:06] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil)
[18:09:11] <wikibugs>	 (03PS2) 10RobH: decom db2039 [puppet] - 10https://gerrit.wikimedia.org/r/519473 (https://phabricator.wikimedia.org/T225988)
[18:09:23] <wikibugs>	 (03CR) 10RobH: [C: 03+2] decom db2039 prod dns [dns] - 10https://gerrit.wikimedia.org/r/519474 (https://phabricator.wikimedia.org/T225988) (owner: 10RobH)
[18:10:01] <wikibugs>	 (03Merged) 10jenkins-bot: Restrict uploading on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil)
[18:10:16] <wikibugs>	 (03CR) 10jenkins-bot: Restrict uploading on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516623 (https://phabricator.wikimedia.org/T225505) (owner: 10Lokal Profil)
[18:11:13] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10RobH) a:05RobH→03Papaul
[18:11:56] <wikibugs>	 (03PS1) 10Urbanecm: Remove several wikis from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519475 (https://phabricator.wikimedia.org/T185898)
[18:12:16] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[18:13:43] <herron>	 !log kafka2001 -> kafka-main2001 migration complete.  re-enabling alerting on kafka-main2001, and moving kafka2001 to role::spare::system T225005
[18:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:49] <stashbot>	 T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005
[18:13:58] <wikibugs>	 (03PS1) 10Herron: Revert "kafka-main2001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519476
[18:14:07] <wikibugs>	 (03PS2) 10Herron: Revert "kafka-main2001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519476
[18:14:59] <wikibugs>	 (03PS1) 10Urbanecm: Add + in front of wikimaniawiki in GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519477 (https://phabricator.wikimedia.org/T225505)
[18:15:14] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519477 (https://phabricator.wikimedia.org/T225505) (owner: 10Urbanecm)
[18:16:11] <wikibugs>	 (03Merged) 10jenkins-bot: Add + in front of wikimaniawiki in GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519477 (https://phabricator.wikimedia.org/T225505) (owner: 10Urbanecm)
[18:16:25] <wikibugs>	 (03CR) 10jenkins-bot: Add + in front of wikimaniawiki in GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519477 (https://phabricator.wikimedia.org/T225505) (owner: 10Urbanecm)
[18:18:52] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:516623|Restrict uploading on wikimaniawiki]], [[:gerrit:519477|Add + in front of wikimaniawiki in GroupOverrides]] (T225505) (duration: 00m 57s)
[18:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:58] <stashbot>	 T225505: Change upload related permissions on wikimania-wiki - https://phabricator.wikimedia.org/T225505
[18:19:25] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 432 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[18:19:54] <wikibugs>	 (03PS2) 10Urbanecm: Remove several wikis from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519475 (https://phabricator.wikimedia.org/T185898)
[18:20:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519475 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm)
[18:20:26] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: [[:gerrit:516623|Restrict uploading on wikimaniawiki]] (T225505) (duration: 00m 56s)
[18:20:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:54] <wikibugs>	 (03Merged) 10jenkins-bot: Remove several wikis from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519475 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm)
[18:21:06] <wikibugs>	 (03CR) 10jenkins-bot: Remove several wikis from commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519475 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm)
[18:21:25] <wikibugs>	 (03PS1) 10Andrew Bogott: nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647)
[18:21:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott)
[18:21:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott)
[18:22:38] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: [[:gerrit:519475|Remove several wikis from commonsuploads.dblist]] (T185898) (duration: 00m 57s)
[18:22:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:43] <stashbot>	 T185898: Soft disable uploads doesn't work at some wikis - https://phabricator.wikimedia.org/T185898
[18:22:53] <wikibugs>	 (03PS1) 10SBassett: Add rate limiter to Special:ConfirmEmail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733)
[18:24:31] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 432 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[18:25:03] <wikibugs>	 (03PS2) 10Urbanecm: Tidy up groupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898)
[18:25:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm)
[18:26:26] <wikibugs>	 (03Merged) 10jenkins-bot: Tidy up groupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm)
[18:26:40] <wikibugs>	 (03CR) 10jenkins-bot: Tidy up groupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519180 (https://phabricator.wikimedia.org/T185898) (owner: 10Urbanecm)
[18:26:51] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: add basic kubeadm infra [puppet] - 10https://gerrit.wikimedia.org/r/519480 (https://phabricator.wikimedia.org/T215975)
[18:29:26] <wikibugs>	 (03PS2) 10SBassett: Add rate limiter to Special:ConfirmEmail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733)
[18:29:33] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:519180|Tidy up groupOverrides]] (T185898) (duration: 00m 56s)
[18:29:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:39] <stashbot>	 T185898: Soft disable uploads doesn't work at some wikis - https://phabricator.wikimedia.org/T185898
[18:31:34] <wikibugs>	 (03Abandoned) 10SBassett: Add rate limiter to Special:ConfirmEmail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett)
[18:33:30] <Urbanecm>	 !log Morning SWAT done, namespaceDupes.php still running for T173070
[18:33:33] <legoktm>	 !log gerrit set-account --active '"Dzahn"'
[18:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:35] <stashbot>	 T173070: Set default aliases for Project_talk namespace - https://phabricator.wikimedia.org/T173070
[18:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:10] <wikibugs>	 (03CR) 10Jhedden: nova-fullstack: add a watch on the number of leaked VMs. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott)
[18:38:14] <wikibugs>	 (03PS1) 10Herron: kafka2001 move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/519483 (https://phabricator.wikimedia.org/T225005)
[18:38:23] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[18:40:20] <wikibugs>	 (03CR) 10Dzahn: "thank you for fixing this! it had been overlooked indeed. nice to see that it was fixed while i was gone." [puppet] - 10https://gerrit.wikimedia.org/r/513713 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4)
[18:40:42] <wikibugs>	 (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/17156/" [puppet] - 10https://gerrit.wikimedia.org/r/519483 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[18:40:46] <wikibugs>	 (03PS1) 10Urbanecm: Tidy up GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070)
[18:41:08] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafka2001 move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/519483 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[18:42:19] <wikibugs>	 (03PS2) 10Urbanecm: Tidy up GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T185898)
[18:43:24] <wikibugs>	 (03PS1) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660)
[18:46:15] <wikibugs>	 (03PS3) 10Urbanecm: Tidy up GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070)
[18:46:19] <Urbanecm>	 !log Reopen Morning SWAT
[18:46:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm)
[18:46:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Tidy up GroupOverrides (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm)
[18:47:08] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "> Patch Set 3:" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm)
[18:47:42] <wikibugs>	 (03Merged) 10jenkins-bot: Tidy up GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm)
[18:47:58] <wikibugs>	 (03CR) 10jenkins-bot: Tidy up GroupOverrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519484 (https://phabricator.wikimedia.org/T173070) (owner: 10Urbanecm)
[18:48:05] <Urbanecm>	 !log foreachwiki namespaceDupes.php --fix done (T173070)
[18:48:08] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy)
[18:48:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:10] <stashbot>	 T173070: Set default aliases for Project_talk namespace - https://phabricator.wikimedia.org/T173070
[18:49:29] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) I note I can also replicate this in incognito mode
[18:49:58] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: [[:gerrit:Tidy up GroupOverrides]], part 1 (T173070) (duration: 00m 57s)
[18:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:39] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:519484|Tidy up GroupOverrides]] (T173070) (duration: 00m 56s)
[18:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:49] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[18:52:13] <Urbanecm>	 !log Morning SWAT done for real
[18:52:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:08] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "That should get us going!  We'll probably want to scrap our current servers and build again :-p" [puppet] - 10https://gerrit.wikimedia.org/r/519480 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez)
[18:54:09] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Urbanecm) Reminds me of T151770.
[18:54:59] <wikibugs>	 (03PS1) 10Ottomata: Move page_links_change back to new schema aware refine job [puppet] - 10https://gerrit.wikimedia.org/r/519486 (https://phabricator.wikimedia.org/T226268)
[18:57:10] <wikibugs>	 (03PS2) 10Ottomata: Move page_links_change back to new schema aware refine job [puppet] - 10https://gerrit.wikimedia.org/r/519486 (https://phabricator.wikimedia.org/T226268)
[18:57:26] <wikibugs>	 (03PS3) 10Ottomata: Move page_links_change back to new schema aware refine job [puppet] - 10https://gerrit.wikimedia.org/r/519486 (https://phabricator.wikimedia.org/T226268)
[18:58:01] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10RobH)
[18:58:54] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Move page_links_change back to new schema aware refine job [puppet] - 10https://gerrit.wikimedia.org/r/519486 (https://phabricator.wikimedia.org/T226268) (owner: 10Ottomata)
[18:59:24] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10RobH) I've put in T226756, in the future, please followup with me directly on orders (or file #hardware-requests or #procurement tasks).  Assigning a random task...
[19:00:05] <jouncebot>	 longma: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - American version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T1900).
[19:01:52] <mobrovac>	 twentyafterfour: can we get a 10, 15 min delay on the train? we'd need to do a rb deploy
[19:01:52] <wikibugs>	 (03PS1) 10Jeena Huneidi: all wikis to 1.34.0-wmf.11  refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519489
[19:01:55] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.34.0-wmf.11  refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519489 (owner: 10Jeena Huneidi)
[19:02:01] <longma>	 uh oh
[19:02:14] <longma>	 what should I do now?
[19:02:41] <hoo>	 longma: You can just remove +2s
[19:02:48] <hoo>	 I just went ahead and did that for you
[19:02:54] <longma>	 oh thanks
[19:03:05] <mobrovac>	 ah right you're the train conducter this week longma
[19:04:14] <longma>	 so when mobrovac is done, I'll just add back the +2 and continue, right?
[19:04:21] <mobrovac>	 correct
[19:04:24] <mobrovac>	 thnx longma
[19:04:28] <mobrovac>	 appreciate it
[19:04:32] <Pchelolo>	 thank you. I'm hitting the button.
[19:04:51] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855
[19:04:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:58] <stashbot>	 T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855
[19:08:21] <longma>	 you're welcome mobrovac Pchelolo . I'll  wait for you to say the coast is clear
[19:12:57] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:15:32] <wikibugs>	 (03PS1) 10Hoo man: Wikidata dumps: Update minimum expected sizes [puppet] - 10https://gerrit.wikimedia.org/r/519493 (https://phabricator.wikimedia.org/T226601)
[19:15:34] <wikibugs>	 (03PS1) 10Hoo man: dumpwikidatajson: Fix error code detection [puppet] - 10https://gerrit.wikimedia.org/r/519494 (https://phabricator.wikimedia.org/T226601)
[19:16:12] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ff6f302]: Use new projects and new config layout T220855 (duration: 11m 21s)
[19:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:17] <stashbot>	 T220855: Split the RESTBase execution paths - https://phabricator.wikimedia.org/T220855
[19:16:48] <Pchelolo>	 and here we go longma
[19:17:06] <Pchelolo>	 thank you again for giving us this 15 minute delay
[19:17:11] <Pchelolo>	 it's all smooth and nice
[19:18:14] <longma>	 thanks, I'll  continue with the train now then
[19:20:45] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.34.0-wmf.11  refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519489 (owner: 10Jeena Huneidi)
[19:21:39] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.11  refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519489 (owner: 10Jeena Huneidi)
[19:21:57] <wikibugs>	 (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.11  refs T220736 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519489 (owner: 10Jeena Huneidi)
[19:22:29] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:23:43] <logmsgbot>	 !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.11  refs T220736
[19:23:45] <Urbanecm>	 !log run namespaceDupes.php for wikis in P8674 (T173070)
[19:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:48] <stashbot>	 T220736: 1.34.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T220736
[19:24:53] <stashbot>	 Urbanecm: Failed to log message to wiki. Somebody should check the error logs.
[19:24:55] <stashbot>	 T173070: Set default aliases for Project_talk namespace - https://phabricator.wikimedia.org/T173070
[19:25:55] <Urbanecm>	 stashbot, thanks, added manually :)
[19:25:55] <stashbot>	 See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help.
[19:28:19] <wikibugs>	 (03PS1) 10Ladsgroup: grafana: Make the wikimedia logo white [puppet] - 10https://gerrit.wikimedia.org/r/519495
[19:29:21] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: CRITICAL: nf_conntrack is 99 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:29:53] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 50.63, 24.41, 16.50
[19:30:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 68.90, 32.43, 20.06
[19:30:23] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 67.00, 31.74, 19.89
[19:30:25] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 57.73, 29.62, 17.58
[19:30:27] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 59.90, 30.03, 18.18
[19:30:31] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 64.05, 31.07, 19.37
[19:30:32] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 76.20, 39.19, 22.90
[19:30:39] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1313 is CRITICAL: CRITICAL - load average: 72.14, 35.93, 23.07
[19:30:39] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 64.47, 33.95, 19.67
[19:30:41] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 64.55, 35.01, 21.19
[19:30:43] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 64.65, 31.39, 19.58
[19:30:43] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 64.96, 34.35, 21.04
[19:30:47] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 53.39, 29.50, 17.54
[19:30:49] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1224 is CRITICAL: CRITICAL - load average: 54.50, 26.57, 15.39
[19:30:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 54.15, 29.85, 17.53
[19:31:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1315 is CRITICAL: CRITICAL - load average: 72.71, 41.95, 26.14
[19:31:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1344 is CRITICAL: CRITICAL - load average: 78.22, 41.22, 25.36
[19:31:30] <James_F>	 Ouch. Train issue?
[19:31:35] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 85.32, 47.43, 27.82
[19:31:40] <longma>	 not sure
[19:31:49] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:31:49] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 33.28, 30.67, 20.68
[19:31:57] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 30.62, 29.75, 20.07
[19:32:05] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1313 is OK: OK - load average: 38.46, 34.42, 23.70
[19:32:09] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 32.95, 29.90, 20.13
[19:32:09] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 28.94, 30.08, 20.74
[19:32:19] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:32:28] <longma>	 this happened on tuesday also
[19:32:38] <James_F>	 Oh, dying back down.
[19:32:42] <mutante>	 so it wasn't related to the "run NamespaceDupes.php" part then?
[19:33:07] <icinga-wm>	 PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:33:29] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1289 is CRITICAL: CRITICAL - load average: 66.59, 37.98, 23.70
[19:33:33] <icinga-wm>	 PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[19:33:41] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.796 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:33:45] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 59.91, 39.13, 22.93
[19:33:50] <James_F>	 mutante: Running namespaceDupes was a SWAT thing, right?
[19:34:05] <James_F>	 And that's a maintenance script with code in it to back off if it's going to overload the site anyway.
[19:34:21] <Reedy>	 It wouldn't update the API servers either
[19:34:24] <Reedy>	 DB servers, maybe
[19:34:28] <longma>	 Is there something I should look at to determine:?
[19:34:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 74913 bytes in 2.276 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:32] <Reedy>	 s/update/upset/
[19:34:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:35:11] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1224 is OK: OK - load average: 14.39, 22.75, 16.93
[19:35:55] <mutante>	 James_F: yes, and ok. then "just" the issue that has happened before during deployment
[19:35:59] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 19.10, 28.84, 23.22
[19:36:09] <mutante>	 longma: it seems to be over already
[19:36:23] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1289 is OK: OK - load average: 22.39, 31.36, 23.67
[19:36:24] <longma>	 yeah
[19:36:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:36:35] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 15.75, 23.94, 19.35
[19:37:09] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1344 is OK: OK - load average: 21.27, 34.70, 28.77
[19:37:22] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1347 is OK: OK - load average: 20.04, 34.09, 29.06
[19:37:22] <longma>	 thcipriani just mentioned it's normal for  this to happen
[19:37:41] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 17.94, 24.29, 20.10
[19:37:41] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 19.97, 25.60, 21.22
[19:37:47] <thcipriani>	 unfortunately, it seems that when deploying train this happens fairly regularly :\
[19:38:10] <thcipriani>	 that is, hhvm will use up a ton of resources for a short time on each appserver
[19:38:35] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1315 is OK: OK - load average: 23.23, 37.96, 32.71
[19:39:05] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:39:17] <Reedy>	 hhvm sucks
[19:40:59] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 18.98, 24.67, 22.07
[19:41:55] <icinga-wm>	 PROBLEM - Apache HTTP on mw1275 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:43:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1275 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:44:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:45:11] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:46:15] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1002 is OK: OK: nf_conntrack is 76 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:47:30] <Amir1>	 longma: is the train done? When is it going to be done?
[19:47:59] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 16.06, 20.82, 23.76
[19:48:18] <Reedy>	 Amir1: Should be done
[19:48:35] <Reedy>	 Mostly just watching HHVM noise improve again
[19:48:35] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 12.48, 17.72, 23.99
[19:49:11] <Amir1>	 Reedy: Okay, I want to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/519492 to reduce size of jobqueue by 30%
[19:49:32] <Amir1>	 (I can deploy it in evening SWAT but it's 1am here)
[19:49:57] <Reedy>	 Yeah, just need to wait for longma to respond you're ok to deploy me thinks
[19:50:16] <longma>	 sorry Amir1 , yeah train is done
[19:50:59] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:51:15] <Amir1>	 awesome!
[19:53:01] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:55:19] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:56:37] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 17.17, 17.76, 23.61
[19:58:28] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "Looks good to me if someone wants to double check the numbers" [puppet] - 10https://gerrit.wikimedia.org/r/519493 (https://phabricator.wikimedia.org/T226601) (owner: 10Hoo man)
[19:59:11] <icinga-wm>	 PROBLEM - SSH on bast3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:59:18] <wikibugs>	 (03PS2) 10Andrew Bogott: nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647)
[19:59:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott)
[20:02:47] <wikibugs>	 (03PS2) 10Bstorm: toolforge: k8s: add basic kubeadm infra [puppet] - 10https://gerrit.wikimedia.org/r/519480 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez)
[20:03:59] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1001 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[20:06:19] <icinga-wm>	 RECOVERY - SSH on bast3002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:07:35] <wikibugs>	 (03PS3) 10Andrew Bogott: nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647)
[20:07:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:07:53] <icinga-wm>	 PROBLEM - Prometheus bast3002/ops restarted: beware possible monitoring artifacts on bast3002 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops
[20:08:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott)
[20:09:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:10:24] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: k8s: add basic kubeadm infra [puppet] - 10https://gerrit.wikimedia.org/r/519480 (https://phabricator.wikimedia.org/T215975) (owner: 10Arturo Borrero Gonzalez)
[20:10:57] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 20.08, 20.47, 23.75
[20:13:48] <wikibugs>	 (03PS1) 10DannyS712: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765)
[20:14:21] <icinga-wm>	 PROBLEM - Prometheus prometheus1003/global restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[20:14:27] <wikibugs>	 (03PS4) 10Andrew Bogott: nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647)
[20:14:37] <icinga-wm>	 PROBLEM - Prometheus prometheus1004/global restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[20:14:41] <James_F>	 mutante: Clearly we should move to php72 faster! ;-(
[20:14:53] <icinga-wm>	 PROBLEM - Prometheus prometheus2003/global restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global
[20:14:58] <James_F>	 Amir1: Have you deployed?
[20:15:11] <icinga-wm>	 PROBLEM - Prometheus prometheus2004/global restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global
[20:15:15] <mutante>	 James_F: looks like prometheus is restarting on everything
[20:15:24] <mutante>	 eh, i mean on prometheus* and bast*
[20:15:32] <mutante>	 and yes @ 7.2
[20:15:35] <James_F>	 Fun.
[20:15:47] <Amir1>	 On it
[20:16:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott)
[20:16:16] <wikibugs>	 (03PS5) 10Andrew Bogott: nova-fullstack: add a watch on the number of leaked VMs. [puppet] - 10https://gerrit.wikimedia.org/r/519478 (https://phabricator.wikimedia.org/T226647)
[20:18:49] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/Wikibase: [[gerrit:519492|Avoid inserting a new addUsage job when the current usage stays untouched (duration: 01m 14s)
[20:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:57] <Amir1>	 James_F: I'm done, I'm continuing to monitor things
[20:20:02] <James_F>	 Cool.
[20:21:42] <wikibugs>	 (03PS2) 10DannyS712: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765)
[20:22:13] <wikibugs>	 (03Restored) 10SBassett: Add rate limiter to Special:ConfirmEmail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett)
[20:22:19] <wikibugs>	 (03PS3) 10SBassett: Add rate limiter to Special:ConfirmEmail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733)
[20:24:07] <wikibugs>	 (03PS1) 10Andrew Bogott: nova-fullstack monitoring: fix a misnamed file [puppet] - 10https://gerrit.wikimedia.org/r/519513 (https://phabricator.wikimedia.org/T226647)
[20:25:35] <wikibugs>	 (03PS4) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733)
[20:25:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack monitoring: fix a misnamed file [puppet] - 10https://gerrit.wikimedia.org/r/519513 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott)
[20:25:44] <wikibugs>	 (03PS5) 10SBassett: Add rate limiter to Special:ConfirmEmail - core change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733)
[20:26:37] <icinga-wm>	 PROBLEM - puppet last run on cloudcontrol1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/check_nova_fullstack_leaks.py]
[20:26:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add rate limiter to Special:ConfirmEmail - core change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett)
[20:27:01] <wikibugs>	 (03PS6) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733)
[20:31:06] <Amir1>	 _joe_: https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?panelId=1&fullscreen&orgId=1&from=now-1h&to=now
[20:31:07] <wikibugs>	 (03PS3) 10DannyS712: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765)
[20:31:19] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/MobileFrontend/resources/dist: T221191: Log editor switches to visualeditorfeatureuse (duration: 00m 50s)
[20:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:24] <stashbot>	 T221191: VE mobile default: start tracking editor switches on mobile + desktop - https://phabricator.wikimedia.org/T221191
[20:31:58] <_joe_>	 Amir1: wow that's amaizing
[20:32:03] <icinga-wm>	 RECOVERY - puppet last run on cloudcontrol1003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[20:32:45] <Amir1>	 YESS \o/
[20:33:10] <wikibugs_>	 (03PS4) 10DannyS712: Clean up wgNamespaceAliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765)
[20:34:48] <wikibugs_>	 (03CR) 10DannyS712: "1 question, otherwise should be ready to go" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) (owner: 10DannyS712)
[20:36:01] <icinga-wm>	 RECOVERY - Prometheus prometheus1003/global restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[20:36:21] <icinga-wm>	 RECOVERY - Prometheus prometheus1004/global restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[20:36:37] <icinga-wm>	 RECOVERY - Prometheus prometheus2003/global restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global
[20:36:51] <icinga-wm>	 RECOVERY - Prometheus bast3002/ops restarted: beware possible monitoring artifacts on bast3002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops
[20:36:57] <icinga-wm>	 RECOVERY - Prometheus prometheus2004/global restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global
[20:39:06] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10MarcoAurelio) @Qgil Thanks for your comment. As a user of some mailing lists, I am still interesting in upgrading to Mailman 3+. At T52864#5022944 w...
[20:40:07] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1001 is OK: OK: nf_conntrack is 78 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[20:42:20] <wikibugs>	 10Operations, 10observability: consider running bastion Prometheis inside cgroups - https://phabricator.wikimedia.org/T226769 (10CDanis)
[20:43:27] <cdanis>	 mutante: James_F: re: the prometheus alerts for the 'global' prometheus -- that alert actually comes from Prometheus's internal uptime metric, which is exported by each Prometheus instance to the global ones -- so when one goes, all the globals will alert as well
[20:43:46] <cdanis>	 the documentation for the alert states such, but it's still confusing, and also I added this alert before we supported a notes_url on a check_prometheus alert
[20:43:48] <cdanis>	 which I should now fix :)
[20:45:30] <James_F>	 Ha.
[20:47:03] <cdanis>	 argh, wait, that's still pending
[20:47:30] <wikibugs>	 (03CR) 10CDanis: "What's blocking this?  I'd love to start adding notes urls to check_prometheus rules" [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond)
[20:47:46] <mutante>	 cdanis: aha! thanks much, also for the notes_url
[20:48:10] <cdanis>	 okay once I can, I'll add a link to https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted for that alert
[20:48:43] <cdanis>	 perhaps I can clarify it for global proms
[20:50:30] <mutante>	 cool!
[20:50:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:50:57] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[20:50:59] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:51:19] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:51:33] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[20:51:42] <jbond42|away>	 cdanis: only thing blocking https://gerrit.wikimedia.org/r/509365 is review.  i have tried to add urls that look sane our we could default to a landing page which says please create a page for this
[20:51:43] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[20:51:56] <jbond42|away>	 but i guess that can wait :(
[20:52:00] <cdanis>	 no it's okay
[20:52:05] <cdanis>	 not urgent, just curious
[20:52:06] <wikibugs>	 (03PS1) 10CDanis: prometheus restarts: clarify alert for global proms [puppet] - 10https://gerrit.wikimedia.org/r/519517
[20:52:16] <cdanis>	 jbond42|away: I'll take a pass tonight or tomorrow
[20:52:28] <jbond42|away>	 ack cheers
[20:52:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus restarts: clarify alert for global proms [puppet] - 10https://gerrit.wikimedia.org/r/519517 (owner: 10CDanis)
[20:52:35] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[20:52:42] <jbond42|away>	 are theses ^^ errors somthing to worry about as im online?
[20:52:53] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.5402 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[20:53:40] <cdanis>	 going to guess one of our usual 503 spikes
[20:53:50] <cdanis>	 yeah, looks like it, and already resolved
[20:54:03] <jbond42|away>	 ok thakns chris
[20:54:23] <cdanis>	 cp1075 had a stomachache and caused about 90k 503s
[20:55:27] <jbond42|away>	 ack
[20:56:13] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[20:56:47] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:56:57] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[20:56:59] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[20:57:15] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:57:15] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[20:57:33] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:57:43] <wikibugs>	 (03PS2) 10CDanis: prometheus restarts: clarify alert for global proms [puppet] - 10https://gerrit.wikimedia.org/r/519517
[20:58:53] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[21:00:11] <wikibugs>	 (03PS3) 10CDanis: prometheus restarts: clarify alert for global proms [puppet] - 10https://gerrit.wikimedia.org/r/519517
[21:01:43] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[21:01:49] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[21:02:21] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[21:02:35] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[21:02:47] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/17162/" [puppet] - 10https://gerrit.wikimedia.org/r/519517 (owner: 10CDanis)
[21:02:57] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[21:03:05] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[21:07:43] <thedj>	 i have 500's again on commons page 
[21:07:58] <thedj>	 * Connection state changed (MAX_CONCURRENT_STREAMS updated)!
[21:07:58] <thedj>	 < HTTP/2 500 
[21:07:58] <thedj>	 < date: Thu, 27 Jun 2019 21:07:30 GMT
[21:07:58] <thedj>	 < server: Varnish
[21:07:58] <thedj>	 < 
[21:08:00] <thedj>	 * Connection #0 to host commons.m.wikimedia.org left intact
[21:08:13] <thedj>	 blank white page. like the http/2 connection just vanished
[21:08:33] <thedj>	 from .nl
[21:09:17] <mutante>	 thedj: we just had a short spike .. "cp1075 had a stomachache and caused about 90k 503s"
[21:09:31] <wikibugs>	 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Development services), and 3 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) Thanks @MoritzMuehlenhoff !  I really appreciate it! I'll insta...
[21:09:40] <thedj>	 https://phabricator.wikimedia.org/P8679
[21:10:56] <hauskatze>	 mutante: :D
[21:11:15] <mutante>	 ah, that's different thedj
[21:11:54] <thedj>	 seems pretty consistent...
[21:12:45] <mutante>	 thedj: could be https://phabricator.wikimedia.org/T209590
[21:12:58] <mutante>	 i notice " Using HTTP2"
[21:13:34] <mutante>	 the "MAX_CONCURRENT_STREAMS" string appears in both
[21:14:08] <mutante>	 hauskatze: hello
[21:14:36] <cdanis>	 hmmm
[21:14:40] <cdanis>	 thedj: that's really interesting
[21:15:05] <thedj>	 have it on http 1.1 as well
[21:15:15] <thedj>	 if i force the client to 1.1
[21:15:40] <thedj>	 and it's on the mobile page, but not the standard commons page
[21:19:19] <thedj>	 what's weird is that it seems to die on the varnish layer, because otherwise i'd see the varnish error page right ?
[21:19:33] <cdanis>	 yeah, you're also missing a bunch of response headers I'd expect even on a varnish error
[21:19:53] <cdanis>	 like X-Cache should still be there
[21:20:23] <cdanis>	 I'll collect some logs and file a ticket
[21:20:33] <thedj>	 so.. tls termination or something ?
[21:20:42] <thedj>	 cdanis: thx
[21:21:00] <cdanis>	 I don't think so, although I don't know for sure; I suspect either weird internal varnish error, or a bug in our VCL
[21:22:16] <thedj>	 and  only when logged in..
[21:22:37] <cdanis>	 yeah, I just noticed that
[21:22:47] <cdanis>	 if I omit your Cookie header from the curl command line, everything is fine
[21:23:20] <wikibugs>	 (03PS1) 10Ppchelko: Remove references to pdfrender from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/519526 (https://phabricator.wikimedia.org/T226675)
[21:24:53] <wikibugs>	 (03CR) 10Ppchelko: "Step 1 from https://gerrit.wikimedia.org/r/c/operations/puppet/+/514226#message-10d3c9fb179f10e94b7b35bc1a308b438be972af" [puppet] - 10https://gerrit.wikimedia.org/r/519526 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko)
[21:25:27] <wikibugs>	 (03PS1) 10Bstorm: toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531)
[21:25:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm)
[21:35:20] <wikibugs>	 (03PS2) 10Bstorm: toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531)
[21:35:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm)
[21:35:56] <Krinkle>	 brion: got 1.5h before the next deploy window, so no worries :)
[21:36:28] <brion>	 Krinkle: shall i wait for that window then? :)
[21:36:36] * brion is in new york so time zones are confusing
[21:37:47] <wikibugs>	 10Operations, 10SRE-Access-Requests: please re-activate LDAP access for Dzahn - https://phabricator.wikimedia.org/T226744 (10Dzahn) Thanks, confirmed working Wikitech (with new password after reset) and then Gerrit and Phabricator after i got separately unblocked there as well.
[21:39:06] <wikibugs>	 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10CDanis)
[21:39:43] <Krinkle>	 brion: If you'd like one of the SWAT-deployers to roll it out instead, you can add it to the wiki page for that window instead.
[21:39:53] <Krinkle>	 Or as deployer, could roll it out now. 
[21:40:34] <Krinkle>	 I meant that we have 1.5 before someone else wants to deploy something :)
[21:41:18] <brion>	 cool, ok :D
[21:41:24] <brion>	 yeah as soon as it merges i'm good
[21:42:11] <brion>	 aaaand there it is
[21:42:19] <wikibugs>	 (03PS3) 10Bstorm: toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531)
[21:43:21] <brion>	 ok it at least doesn't break production web views, but that's not surprising ;)
[21:43:48] <brion>	 !mwlog deploying fix for TMH jobqueue bug T226748
[21:43:49] <stashbot>	 T226748: WebVideoTranscodeJob fatal: Call to getStdout() on a non-object - https://phabricator.wikimedia.org/T226748
[21:44:02] <brion>	 hmm did i misremember that loggy thing
[21:44:08] <brion>	 !log deploying fix for TMH jobqueue bug T226748
[21:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:44:14] <brion>	 yay
[21:44:19] <thedj>	 brion: check with Reedy btw.. https://phabricator.wikimedia.org/T226713
[21:44:19] <paladox>	 :D
[21:44:58] <thedj>	 brion: i see he was also looking at running requeueTranscodes
[21:44:59] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[21:45:34] <brion>	 thedj: that probably explains why we saw multiple fatals in production, something failed out during that and barfed :D
[21:45:44] <thedj>	 brion: likely
[21:45:59] <brion>	 ok i won't re-run it then, until we have time to figure out what the files were failing on
[21:46:21] <Reedy>	 it's still running
[21:46:28] <Reedy>	 it's only on Fo files
[21:47:08] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[21:47:41] <brion>	 Reedy: yeah later i should add better filter options so we can re-transcode a single file type rather than a whole media type
[21:47:49] <Reedy>	 brion: I filed a bug for that :P
[21:47:51] <brion>	 :D
[21:47:59] <Reedy>	 https://phabricator.wikimedia.org/T226718
[21:48:30] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:48:45] <brion>	 great, i won't do that immediately as i'm technically on vacation today and tomorrow ;) but can fix it up next week. remind me if i don't poke it
[21:48:59] <brion>	 should be pretty easy
[21:49:31] <brion>	 ok it claims to be done
[21:49:31] <wikibugs>	 (03PS1) 10Andrew Bogott: nova-fullstack: fix name of nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/519531 (https://phabricator.wikimedia.org/T226647)
[21:50:09] <Reedy>	 yeah, exactly. Just needs a param for the other two columns in the index
[21:51:04] <brion>	 Krinkle: thx for the bug report :D i'll check the logs after more files have run to see if there's any commonality to the transcode errors that were triggering the fatals
[21:51:16] <brion>	 Reedy: and thanks for running the batch script :D
[21:51:34] <thedj>	 which reminds me to open a ticket on that piece of software that it should use exitcodes...
[21:51:40] <brion>	 thedj: and thanks for being awesome!
[21:51:50] <thedj>	 brion: np. go vacation you.
[21:51:54] <brion>	 :D
[21:52:04] <thedj>	 :)
[21:52:05] <brion>	 achievement unlocked: deployed from a hotel bar
[21:52:13] <tzatziki>	 gods :P 
[21:52:16] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: fix name of nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/519531 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott)
[21:52:42] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:53:10] <thedj>	 cdanis: and it works .. probably this then ^^^
[21:53:36] <cdanis>	 oh, so it was an app layer problem? interesting.  still not sure what's up with the missing headers
[21:53:45] <thedj>	 or.. it was my old cookie...
[21:54:01] <thedj>	 that's the only other thing that changed since the last refresh
[21:54:38] <wikibugs>	 (03PS2) 1020after4: phabricator: Allow admins to silence maniphest bulk jobs via sudo [puppet] - 10https://gerrit.wikimedia.org/r/517140
[21:54:40] <wikibugs>	 (03PS1) 1020after4: phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677)
[21:55:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4)
[21:55:35] <brion>	 ok signing off for now, but do ping me if anything explodes unexpectedly, i won't go far. ;)
[21:55:45] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[21:56:21] <wikibugs>	 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10CDanis) So, it looks like that this 500 did in fact come from the application layer... but shouldn't we still be getting more response headers from the edge?
[21:56:23] <wikibugs>	 (03PS2) 1020after4: phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677)
[21:56:32] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:56:40] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[21:57:14] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:57:17] <cdanis>	 sigh
[21:57:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4)
[21:57:22] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:57:22] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[21:57:36] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:57:36] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[21:57:44] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[21:58:15] <cdanis>	 !log cdanis@cp1075.eqiad.wmnet ~ % sudo -i varnish-backend-restart
[21:58:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:05] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[21:59:32] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[21:59:45] <wikibugs>	 10Operations, 10observability: consider running bastion Prometheis inside cgroups - https://phabricator.wikimedia.org/T226769 (10CDanis) 05Open→03Invalid I'm told the plan is to move these onto Ganeti in PoPs, so that seems just as good.
[22:00:06] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[22:00:29] <wikibugs>	 10Operations, 10observability: consider running bastion Prometheus inside cgroups - https://phabricator.wikimedia.org/T226769 (10faidon)
[22:01:05] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.3982 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[22:01:12] <wikibugs>	 (03PS2) 10Gilles: Serve JPG when WEBP conversion fails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707)
[22:01:15] <wikibugs>	 (03CR) 10Gilles: Serve JPG when WEBP conversion fails (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles)
[22:02:50] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[22:03:15] <wikibugs>	 10Operations, 10observability: consider running bastion Prometheis inside cgroups - https://phabricator.wikimedia.org/T226769 (10faidon)
[22:03:36] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[22:03:38] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[22:03:50] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[22:04:46] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[22:04:54] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[22:05:08] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[22:05:18] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[22:05:28] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[22:06:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) p:05Triage→03Normal
[22:06:43] <wikibugs>	 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 5 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac) Regarding the deployment plan, the main pain point is that we will need to ha...
[22:08:22] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[22:09:10] <icinga-wm>	 PROBLEM - puppet last run on phab1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 28 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php7.2-mysqlnd]
[22:09:20] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[22:10:02] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[22:10:44] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[22:11:08] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[22:11:16] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[22:11:47] <wikibugs>	 10Operations, 10media-storage, 10serviceops, 10Patch-For-Review, 10User-jijiki: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Gilles) At a glance on a given proxy the same object doesn't occur multiple times in a row. But the same desti...
[22:12:51] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519509 (https://phabricator.wikimedia.org/T226765) (owner: 10DannyS712)
[22:13:13] <wikibugs>	 (03PS3) 1020after4: phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677)
[22:14:21] <wikibugs>	 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH)
[22:16:12] <wikibugs>	 10Operations, 10MediaWiki-Maintenance-scripts, 10Performance-Team: cron spam for slow queries on mwmaint /usr/local/bin/foreachwiki initSiteStats.php --update > /dev/null - https://phabricator.wikimedia.org/T216243 (10Krinkle)
[22:22:28] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "> This can wait until July 11." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic)
[22:25:34] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 132.3 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[22:25:44] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH)
[22:26:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh - https://phabricator.wikimedia.org/T226782 (10RobH)
[22:27:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH)
[22:27:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4)
[22:28:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH)
[23:00:05] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190627T2300).
[23:00:05] <jouncebot>	 Amir1: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:13] <RoanKattouw>	 I'll SWAT
[23:00:21] <RoanKattouw>	 (and I also have a patch the bot didn't notice
[23:01:38] <RoanKattouw>	 Amir1: Are you here for your SWAT? It's pretty late for you
[23:12:36] <wikibugs>	 (03CR) 1020after4: [C: 03+1] phabricator: Manage ownership of /var/log/phd/ssh.log [puppet] - 10https://gerrit.wikimedia.org/r/519532 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4)
[23:14:17] <xinbenlv>	 REPORT OUTAGE: mediawiki changestream is currently down. (https://stream.wikimedia.org/v2/stream/recentchange) https://www.irccloud.com/pastebin/VdGZnjZz/
[23:15:34] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[23:20:12] <RoanKattouw>	 14:44:09 <brion> !log deploying fix for TMH jobqueue bug T226748
[23:20:13] <stashbot>	 T226748: WebVideoTranscodeJob fatal: Call to getStdout() on a non-object - https://phabricator.wikimedia.org/T226748
[23:20:41] <RoanKattouw>	 brion: Did you reploy this to a different server or something? It isn't on deploy1001, I just pulled something else on there and this came with it
[23:20:56] <brion>	 RoanKattouw: hmm, lemme double-check
[23:21:46] <RoanKattouw>	 Current state of deploy1001: I've run git pull (which pulled in the wmf.11 commit updating the submodule) but not git submodule update extensions/TMH (which would check out that commit in TMH)
[23:22:06] <brion>	 RoanKattouw: it was deploy1001 yeah
[23:22:17] <RoanKattouw>	 Then what/how did you deploy it?
[23:22:18] <brion>	 ah i best i forgot a step
[23:22:19] <brion>	 *bet
[23:22:37] <RoanKattouw>	 I don't see any syncs in the log, and it came riding in with my git pull so you couldn't have pulled it on there either
[23:22:45] <RoanKattouw>	 Perhaps you ran git pull in the config repo or the wmf.10 checkout?
[23:22:58] <brion>	 yep, i completely failed to pull in the patch :D
[23:23:05] * brion slaps self
[23:23:07] <RoanKattouw>	 In any case, if you're around now, I can deploy it for you now and you can see if it worked :)
[23:23:11] <brion>	 great :D
[23:23:42] <RoanKattouw>	 It's on mwdebug1002, in case it's testable there (job queue stuff isn't always)
[23:24:33] <brion>	 RoanKattouw: yeh no way to test it from mwdebug1002
[23:24:44] <brion>	 that's what i hate about deploying these job queue changes, they're very hard to test currently
[23:25:04] <brion>	 just gotta wait until it triggers another fatal in the background jobs (or fails to do so)
[23:25:48] <brion>	 !log roan is fixing deploy of T226748 which failed to include the patch (whoops)
[23:25:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:54] <stashbot>	 T226748: WebVideoTranscodeJob fatal: Call to getStdout() on a non-object - https://phabricator.wikimedia.org/T226748
[23:26:40] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/GrowthExperiments/includes/HomepageHooks.php: Fix JS error on Special:Homepage (duration: 00m 50s)
[23:26:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:28:55] <wikibugs>	 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, and 2 others: Phabricator email comments not posted - https://phabricator.wikimedia.org/T224752 (10Dzahn) nice fix! sorry, i was off for quite some time and reading this now. indeed a good catch that we did not catch during the migration. thanks all!
[23:28:56] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/TimedMediaHandler/: T226748 (duration: 00m 50s)
[23:29:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:25] <brion>	 next deploy i'll be sure to follow all the steps more carefully :D
[23:30:27] <brion>	 thanks roan!
[23:31:41] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Qgil) To be clear, the proposal is that users of a Mailing mailing list about X could keep using the same email features subscribing in mailing list...
[23:33:04] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10revi) Q: Does discourse support 'mailing list mode' with NO archives left after it is distributed? At least, that's how it works on [[https://lists.wikime...
[23:43:04] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[23:44:22] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers