[00:00:04] <jouncebot>	 Deploy window NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190628T0000)
[00:27:50] <Krinkle>	 Happy Friday
[00:36:28] <wikibugs>	 (03PS1) 10Urbanecm: Setup EditorJourney for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519581 (https://phabricator.wikimedia.org/T225737)
[00:36:47] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[01:22:04] <Krinkle>	 !log Killing arclamp-log on webperf1002, no flame graphs for three days, presumably mwlog/redis connection dropped again. T215740
[01:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:22:10] <stashbot>	 T215740: Create Icinga check for ArcLamp (xenon-log) service health - https://phabricator.wikimedia.org/T215740
[02:24:40] <wikibugs>	 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF)
[02:24:42] <wikibugs>	 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Jdforrester-WMF)
[02:26:27] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 134.8 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[02:26:53] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Jdforrester-WMF) This was a blocker for {T219150}, right? Does {T224857} also block that or is this work s...
[02:30:49] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22484160 and 2 seconds
[02:35:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 989840 and 0 seconds
[03:17:53] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[03:17:59] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[03:18:01] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{
[03:18:01] <icinga-wm>	 regated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[03:18:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:18:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[03:18:19] <icinga-wm>	 PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid
[03:18:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:18:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:18:39] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[03:18:55] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[03:18:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:18:55] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[03:19:15] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[03:19:29] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/
[03:19:37] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[03:19:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:19:43] <icinga-wm>	 RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid
[03:19:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:20:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:20:03] <icinga-wm>	 RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[03:20:17] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[03:20:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:20:21] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[03:20:49] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[03:20:49] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[03:20:49] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[03:21:21] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 52.22 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[03:24:17] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 80.61 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[03:58:54] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10jijiki) >>! In T224491#5291244, @Jdforrester-WMF wrote: > This was a blocker for {T219150}, right?  Yeah...
[05:35:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[05:36:49] <icinga-wm>	 RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 82296 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[06:06:54] <wikibugs>	 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) p:05Triage→03Normal
[06:09:45] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10MoritzMuehlenhoff)
[06:09:49] <wikibugs>	 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was resolved via T215975
[06:18:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Add José Pita to LDAP users table [puppet] - 10https://gerrit.wikimedia.org/r/519594 (https://phabricator.wikimedia.org/T226091)
[06:18:13] <wikibugs>	 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) I cannot reproduce the issue right now, but it does look like a strange interaction between the application servers and varnish.  >>! In T226776#5290831, @CD...
[06:18:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add José Pita to LDAP users table [puppet] - 10https://gerrit.wikimedia.org/r/519594 (https://phabricator.wikimedia.org/T226091) (owner: 10Muehlenhoff)
[06:22:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Camille de Nes to LDAP users table [puppet] - 10https://gerrit.wikimedia.org/r/519595
[06:24:47] <wikibugs>	 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10TheDJ) >>! In T226776#5291441, @ema wrote: > Why did you conclude that?   the second after I refreshed after:  > <+wikibugs> (CR) Andrew Bogott: [C: +2] nova-full...
[06:24:57] <wikibugs>	 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10ema) @Trizek-WMF: personally, I've t...
[06:31:31] <icinga-wm>	 PROBLEM - puppet last run on cp5008 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[06:32:55] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[06:33:17] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[06:33:39] <icinga-wm>	 PROBLEM - puppet last run on db2086 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[06:33:49] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[06:33:57] <wikibugs>	 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) >>! In T226776#5291445, @TheDJ wrote:  > the link started working again. Two things had changed at that time, the thing above and I had logged out and back i...
[06:52:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Camille de Nes to LDAP users table [puppet] - 10https://gerrit.wikimedia.org/r/519595 (owner: 10Muehlenhoff)
[06:53:25] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[06:53:47] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[06:54:19] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[06:58:41] <icinga-wm>	 RECOVERY - puppet last run on cp5008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:00:29] <wikibugs>	 10Operations, 10serviceops: cronspam for slow queries in PageAssessments - https://phabricator.wikimedia.org/T197564 (10Joe) a:05Joe→03None
[07:00:53] <icinga-wm>	 RECOVERY - puppet last run on db2086 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[07:01:08] <wikibugs>	 (03PS1) 10Ema: cache: reimage cp2005 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519597 (https://phabricator.wikimedia.org/T226637)
[07:04:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp2005 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519597 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema)
[07:05:48] <ema>	 !log depool cp2005 and reimage as upload_ats T226637
[07:05:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:54] <stashbot>	 T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637
[07:06:20] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: reimage cp2005 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519597 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema)
[07:09:13] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2005.codfw.wmnet'] ` The log can be found in `...
[07:11:00] <_joe_>	 !log upgrading php-wikidiff2 on the mw canaries, only on php7 - T223391
[07:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:06] <stashbot>	 T223391: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391
[07:17:08] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2009.codfw.wmnet', 'ganeti2010.codfw.wmnet', 'ganeti2011.codfw.wmnet', 'ga...
[07:17:48] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10MoritzMuehlenhoff) Hi Leszek, we have two ways to approach this: If you specifically only need Logstash access, we can extend the configuration f...
[07:18:10] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[07:23:20] <wikibugs>	 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Pruem) @Trizek-WMF: It would help if...
[07:23:31] <wikibugs>	 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10Joe) a:05jijiki→03Joe I did rollout the new version on the canary servers today.  If I don't see higher error rates on mo...
[07:35:46] <wikibugs>	 (03CR) 10Gergő Tisza: Add rate limiter to Special:ConfirmEmail - config change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett)
[07:42:34] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[07:49:00] <wikibugs>	 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2005.codfw.wmnet'] `  and were **ALL** successful.
[07:57:52] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10hashar)
[07:57:54] <ema>	 !log pool cp2005 w/ ATS backend T226637
[07:57:56] <wikibugs>	 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) 05Resolved→03Open I need the package for **Stretch**!
[07:57:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:59] <stashbot>	 T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637
[08:00:07] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Aklapper) For anyone affected by log-in problems who wants to help track them down:  Please see and follow https://www.mediawiki.org/wiki/Manual:How_to_debug/Login_problems and report back here. T...
[08:01:39] <wikibugs>	 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10MoritzMuehlenhoff) And that is what T215975 provides...
[08:02:08] <moritzm>	 !log updating openssl packages on mw1265
[08:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:33] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2014.codfw.wmnet', 'ganeti2009.codfw.wmnet', 'ganeti2013.codfw.wmnet', 'ganeti2012.codfw.wmnet', 'ganeti2011.codfw.wmnet', 'gan...
[08:25:41] <wikibugs>	 (03PS1) 10Elukey: profile::hue: add more specific alarms for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/519601 (https://phabricator.wikimedia.org/T226698)
[08:28:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::hue: add more specific alarms for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/519601 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[08:29:19] <wikibugs>	 10Operations, 10Traffic: nginx HTTP 500 rate increase on specific cache hosts - https://phabricator.wikimedia.org/T226805 (10ema)
[08:29:32] <wikibugs>	 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) >>! In T226776#5291447, @ema wrote: > Well, the error response was surely generated by the applayer and not by varnish (the latter only generates synthetic r...
[08:30:24] <wikibugs>	 10Operations, 10Traffic: nginx HTTP 500 rate increase on specific cache hosts - https://phabricator.wikimedia.org/T226805 (10ema) p:05Triage→03Normal
[08:42:21] <wikibugs>	 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) Ah eventually I found the entry:  ` Name: thirdparty/kubeadm-k8s-docker.com Method: https://download.docker.com/...
[08:43:04] <elukey>	 !log roll restart of eventstreams on all scb2* nodes, service now working (kafka transport failures logged)
[08:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Serve JPG when WEBP conversion fails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles)
[08:48:19] <ema>	 aaa~.
[08:48:40] <ema>	 nice
[08:48:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: "There seem to be consensus to use the Prometheus based dashboard, should we be replacing this one with it instead of removal? AIUI the nam" [puppet] - 10https://gerrit.wikimedia.org/r/519410 (https://phabricator.wikimedia.org/T184942) (owner: 10Cwhite)
[08:57:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: add consumer for client errors [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142)
[09:05:19] <wikibugs>	 (03PS1) 10Ema: cache: reimage cp2008 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519604 (https://phabricator.wikimedia.org/T226637)
[09:06:15] <wikibugs>	 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) p:05Triage→03Normal
[09:07:51] <wikibugs>	 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey)
[09:09:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp2008 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519604 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema)
[09:10:29] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[09:10:31] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:10:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:49] <moritzm>	 !log rebooting releases* hosts for MDS-enabled qemu/kernel
[09:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:19] <wikibugs>	 (03PS2) 10Filippo Giunchedi: logstash: add consumer for client errors [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142)
[09:15:58] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[09:16:00] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:16:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:45] <elukey>	 !log systemctl reset-failed kafka* units on kafka2002 (role spare, failed units, already masked)
[09:16:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:35] <ema>	 !log depool cp2008 and reimage as upload_ats T226637
[09:17:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:40] <stashbot>	 T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637
[09:18:12] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: reimage cp2008 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519604 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema)
[09:21:44] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2008.codfw.wmnet'] ` The log can be found in `...
[09:27:24] <wikibugs>	 (03PS1) 10Ema: vcl: remove Vary:AL workaround for fixcopyright.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/519606 (https://phabricator.wikimedia.org/T203179)
[09:28:51] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[09:28:53] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:28:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:25] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[09:31:08] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[09:35:15] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek) >>! In T225004#5291475, @MoritzMuehlenhoff wrote: > Hi Leszek, > we have two ways to approach this: If you specifically only need Lo...
[09:41:05] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash and creating grafana boards  / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek)
[09:41:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Overall good in premise, comments inline" (0314 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi)
[09:45:26] <wikibugs>	 10Operations, 10vm-requests: Site: eqiad/codfw 2 VMs each for pool counters - https://phabricator.wikimedia.org/T226811 (10MoritzMuehlenhoff)
[09:51:50] <wikibugs>	 (03PS1) 10Elukey: Add missing kerberos config to Hadoop HDFS (test cluster) [puppet] - 10https://gerrit.wikimedia.org/r/519607 (https://phabricator.wikimedia.org/T226698)
[09:54:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add missing kerberos config to Hadoop HDFS (test cluster) [puppet] - 10https://gerrit.wikimedia.org/r/519607 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[09:56:37] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "See comments inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/506954 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[09:56:56] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2009.codfw.wmnet', 'ganeti2010.codfw.wmnet', 'ganeti2011.codfw.wmnet', 'ga...
[09:59:10] <icinga-wm>	 PROBLEM - Host ganeti2010 is DOWN: PING CRITICAL - Packet loss = 100%
[09:59:23] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2008.codfw.wmnet'] `  and were **ALL** successful.
[10:03:05] <icinga-wm>	 RECOVERY - Host ganeti2010 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[10:06:15] <icinga-wm>	 PROBLEM - dhclient process on ganeti2010 is CRITICAL: connect to address 10.192.32.139 port 5666: Connection refused
[10:06:20] <icinga-wm>	 PROBLEM - puppet last run on ganeti2010 is CRITICAL: connect to address 10.192.32.139 port 5666: Connection refused
[10:06:23] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2010 is CRITICAL: connect to address 10.192.32.139 port 5666: Connection refused
[10:06:44] <ema>	 akosiaris: ^
[10:06:45] <icinga-wm>	 PROBLEM - configured eth on ganeti2010 is CRITICAL: connect to address 10.192.32.139 port 5666: Connection refused
[10:06:55] <icinga-wm>	 PROBLEM - MD RAID on ganeti2010 is CRITICAL: connect to address 10.192.32.139 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[10:07:02] <ema>	 is that related to your reimage?
[10:07:24] <akosiaris>	 ema: yup. It's the only host that was actually successfully imaged in the previous run
[10:08:39] <wikibugs>	 10Operations, 10netops, 10User-fgiunchedi: Add centrallog1001 to syslog servers in network ACLs - https://phabricator.wikimedia.org/T226813 (10fgiunchedi)
[10:08:48] <ema>	 akosiaris: ok, anything to do or can I ack the alerts?
[10:09:09] <akosiaris>	 I 'll ack the alerts
[10:09:13] <ema>	 thanks!
[10:09:21] <akosiaris>	 hosts are not in service btw, completely new hosts
[10:09:28] <ema>	 !log pool cp2008 w/ ATS backend T226637
[10:09:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:33] <stashbot>	 T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637
[10:10:36] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::backup::namenode: move crons to timers [puppet] - 10https://gerrit.wikimedia.org/r/519610 (https://phabricator.wikimedia.org/T226698)
[10:10:41] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[10:11:19] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] cookbook API: add class API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[10:13:15] <wikibugs>	 (03Abandoned) 10DCausse: Add a new extension point SshExecuteCommandInterceptor [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/502764 (owner: 10DCausse)
[10:14:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17163/" [puppet] - 10https://gerrit.wikimedia.org/r/519610 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[10:15:20] <icinga-wm>	 RECOVERY - MD RAID on ganeti2010 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[10:15:34] <icinga-wm>	 RECOVERY - dhclient process on ganeti2010 is OK: PROCS OK: 0 processes with command name dhclient
[10:15:46] <icinga-wm>	 RECOVERY - Check systemd state on ganeti2010 is OK: OK - running: The system is fully operational
[10:17:57] <icinga-wm>	 PROBLEM - Host ganeti2009 is DOWN: PING CRITICAL - Packet loss = 100%
[10:19:33] <icinga-wm>	 RECOVERY - configured eth on ganeti2010 is OK: OK - interfaces up
[10:19:53] <icinga-wm>	 RECOVERY - Host ganeti2009 is UP: PING OK - Packet loss = 0%, RTA = 36.30 ms
[10:20:06] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/TimedMediaHandler/maintenance/requeueTranscodes.php: Extra filtering option (duration: 00m 51s)
[10:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:01] <icinga-wm>	 RECOVERY - puppet last run on ganeti2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:25:53] <wikibugs>	 (03PS3) 10Volans: cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212)
[10:26:58] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::balancer: remove unused kerberos wrapper [puppet] - 10https://gerrit.wikimedia.org/r/519612 (https://phabricator.wikimedia.org/T226698)
[10:27:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::hadoop::balancer: remove unused kerberos wrapper [puppet] - 10https://gerrit.wikimedia.org/r/519612 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey)
[10:29:09] <wikibugs>	 10Operations, 10observability, 10serviceops: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10Joe)
[10:31:36] <Reedy>	 !log running `foreachwiki extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --audio --mime=audio/midi --missing --throttle` on mwmaint1002 in screen T226713
[10:31:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:44] <stashbot>	 T226713: Run cleanupTranscodes.php for current midi files - https://phabricator.wikimedia.org/T226713
[10:36:08] <wikibugs>	 10Operations, 10observability, 10serviceops: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10Joe) One relatively easy way to go could be to use mtail, which we use for quite some other things too.   There is even an [[https://gith...
[10:36:29] <wikibugs>	 10Operations, 10observability, 10serviceops: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10Joe)
[10:36:51] <wikibugs>	 (03PS4) 10Volans: cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212)
[10:37:00] <wikibugs>	 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Gerrit manager rights for Ottomata - https://phabricator.wikimedia.org/T226724 (10jbond) p:05Triage→03Normal
[10:37:31] <icinga-wm>	 PROBLEM - puppet last run on ganeti2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[drbd8-utils]
[10:37:46] <wikibugs>	 (03CR) 10Volans: "I finally had a bit of time to resume this work. See inline comments for the various bits." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[10:39:15] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[10:39:38] <wikibugs>	 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Gerrit manager rights for Ottomata - https://phabricator.wikimedia.org/T226724 (10jbond) @Legoktm @hashar are either of you able to help with this?
[10:39:40] <wikibugs>	 (03PS1) 10Ema: cache: reimage cp2011 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519613 (https://phabricator.wikimedia.org/T226637)
[10:40:34] <wikibugs>	 10Operations, 10observability, 10serviceops: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10jbond) p:05Triage→03Normal
[10:41:14] <wikibugs>	 10Operations, 10netops, 10User-fgiunchedi: Add centrallog1001 to syslog servers in network ACLs - https://phabricator.wikimedia.org/T226813 (10jbond) p:05Triage→03Normal
[10:41:22] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp2011 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519613 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema)
[10:42:17] <wikibugs>	 10Operations, 10vm-requests: Site: eqiad/codfw 2 VMs each for pool counters - https://phabricator.wikimedia.org/T226811 (10jbond) p:05Triage→03Normal
[10:44:02] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10jbond) p:05Triage→03High
[10:45:41] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10jbond) p:05Triage→03Normal
[10:48:00] <wikibugs>	 (03CR) 10Volans: "Some replies to the comments inline" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/506954 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[10:48:29] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) Update: it needs more discussion among SRE team :-]
[11:00:46] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) Does it persist after clearing cookies for the wiki domain? No existing cookies in incognito  When logging in in incognito mode? Doesn't work either When logging in with a different kind of...
[11:01:38] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) Probably needs a HAR file at this point
[11:03:08] <wikibugs>	 (03PS1) 10Ayounsi: DNS for netflow1001 [dns] - 10https://gerrit.wikimedia.org/r/519616 (https://phabricator.wikimedia.org/T226810)
[11:04:24] <_joe_>	 !log uploading php-wmerrors to thirdparty/php72 - T187147
[11:04:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:30] <stashbot>	 T187147: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147
[11:07:32] <wikibugs>	 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Samwalton9) https://stream.wikimedia.org/ seems to be down again?
[11:09:57] <fsero>	 !log draining kubernetes2005 for applying updates
[11:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:30] <fsero>	 !log draining kubernetes2006 for applying updates
[11:13:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:28] <fsero>	 !log draining kubernetes1005 for applying updates
[11:14:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:58] <fsero>	 !log draining kubernetes1006 for applying updates
[11:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:19] <wikibugs>	 (03PS2) 10Ayounsi: DNS for netflow1001 [dns] - 10https://gerrit.wikimedia.org/r/519616 (https://phabricator.wikimedia.org/T226810)
[11:21:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] DNS for netflow1001 [dns] - 10https://gerrit.wikimedia.org/r/519616 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[11:33:46] <elukey>	 !log restart eventstreams on scb1001
[11:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:34] <wikibugs>	 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) @Samwalton9 We are currently working on it, there is indeed an issue with the eqiad backend hosts  of eventstreams :(
[11:36:45] <elukey>	 !log roll restart eventstreams on all scb1* nodes
[11:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:41] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Urbanecm) I can reproduce this when using mr.wikipedia.org in mr, however, when I try to login using https://mr.wikipedia.org/w/index.php?title=%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4...
[11:42:49] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) >>! In T226754#5292040, @Urbanecm wrote: > I can reproduce this when using mr.wikipedia.org in mr, however, when I try to login with [interface switched into English](https://mr.wikipedia.o...
[11:45:13] <wikibugs>	 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) Service should be restored now @Samwalton9
[11:46:46] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Urbanecm) Okay, I think this is caused by local content of mediawiki:Loginprompt. When I copied loginprompt from mrwiki to test.wikipedia.org, I reproduced the issue there as well. When I reverted...
[11:49:05] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) >>! In T226754#5292057, @Urbanecm wrote: > Okay, I think this is caused by local content of mediawiki:Loginprompt. When I copied loginprompt from mrwiki to test.wikipedia.org, I reproduced...
[11:49:52] <wikibugs>	 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) We got the following alarm at 10:42 UTC (why only analytics? Shouldn't this be owned by SRE?)  ` 10:42  <icinga-wm> PROBLEM - Check if active Eve...
[11:49:59] <wikibugs>	 10Operations, 10Analytics, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey)
[11:59:18] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Impossible to log into mrwiki due to broken local "MediaWiki:Loginprompt" page - https://phabricator.wikimedia.org/T226754 (10Aklapper)
[12:02:39] <wikibugs>	 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Vort) Problem was reproduced by me j...
[12:03:51] <wikibugs>	 (03PS1) 10Ayounsi: Netboot/DHCP for netflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/519620 (https://phabricator.wikimedia.org/T226810)
[12:05:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netboot/DHCP for netflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/519620 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[12:16:03] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2012.codfw.wmnet', 'ganeti2011.codfw.wmnet', 'ganeti2009.codfw.wmnet'] `  Of which those **FAILED**: ` ['ganeti2012.codfw.wmnet...
[12:17:43] <thedj>	 so i JUST had that very slow character by character page download again from esams
[12:18:42] <wikibugs>	 10Operations, 10Discovery-Search, 10hardware-requests: Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10Gehel) a:05Gehel→03RobH This is scheduled to be done in Q1, so we can get started. As a reminder, some preliminary estimate were done in T222104 (not sure what can / should be re...
[12:19:49] <thedj>	 ema: i have the page open, but the inspector can't check the headers of the request, as i opened the inspector after opening the page.. anything else i can check to find a cause ?
[12:19:49] <wikibugs>	 10Operations, 10Discovery-Search, 10hardware-requests: Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10Gehel)
[12:20:19] <wikibugs>	 10Operations, 10hardware-requests, 10Discovery-Search (Current work): Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10Gehel)
[12:20:35] <wikibugs>	 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10TheDJ) Just happened to me again as...
[12:28:07] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 102.1 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[12:35:13] <wikibugs>	 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) Classification successfully deployed in ulsfo/codfw/eqdfw/eqord (half-ish of our POPs), will push to the other sites early next week.  Then start dropping invalids on IXPs to see the effect it has in term of traffic...
[12:36:39] <godog>	 tgr: I'm trying to trigger errors on demand to test T217142, pasting the code you have at https://en.wikipedia.beta.wmflabs.org/wiki/User:Tgr/common.js in my browser's console should be enough?
[12:36:40] <stashbot>	 T217142: [WIP] [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors - https://phabricator.wikimedia.org/T217142
[12:43:48] <wikibugs>	 10Operations, 10observability: consider running bastion Prometheis inside cgroups - https://phabricator.wikimedia.org/T226769 (10fgiunchedi) Agreed, memory limiting in the interim while Ganeti is being setup sounds good to me.
[12:45:48] <wikibugs>	 10Operations, 10Analytics, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Samwalton9) Looks good to me, thanks!
[12:51:00] <tgr>	 godog: yeah, it adds an error generator link on the left sidebar
[12:51:43] <godog>	 tgr: awesome, thank you!
[12:53:39] <godog>	 tgr: I'm clicking the generated and I see the uncaught error in the console, but I'm not seeing the network request to eventgate-logging in the network tab, it should be there I assume?
[12:53:51] <godog>	 the generated link even
[12:54:48] <tgr>	 yeah, it should
[12:55:31] <tgr>	 works for me
[12:55:35] <godog>	 mhh no still not seeing it, do I need to be logged in perhaps?
[12:56:02] <tgr>	 if you are not logged in how did you add the script in the first place?
[12:56:40] <tgr>	 oh, you were using the JS console?
[12:56:49] <godog>	 yeah
[12:57:05] <tgr>	 that won't work, the script triggering the error has to be same-origin
[12:57:11] <wikibugs>	 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Vort) Here is a screenshot of laggy...
[12:57:39] <tgr>	 otherwise the browser will sanitize away all useful details and Raven will ignore it
[13:00:23] <godog>	 ack, got it thank you
[13:00:41] <godog>	 now trying to create an account on beta I'm getting The database has been automatically locked while the replica database servers catch up to the master. -.-
[13:01:09] <godog>	 from Special:CreateAccount that is
[13:02:55] <godog>	 I'll ask on -releng
[13:03:49] <tgr>	 that seems genuine
[13:08:22] <godog>	 aye, works now including generating the errors
[13:10:26] <wikibugs>	 (03PS1) 10Ayounsi: Apply role netinsights to netflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/519626 (https://phabricator.wikimedia.org/T226810)
[13:13:28] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Apply role netinsights to netflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/519626 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[13:15:19] <wikibugs>	 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10awight) @jijiki We're a bit confused because the beta cluster [[ https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version |...
[13:15:25] <godog>	 tgr|away: ok now errors show up in logstash too, https://logstash-beta.wmflabs.org/goto/3aa34cf042b6149fe7126cba08df1143
[13:15:34] <godog>	 I'll update the puppet patch
[13:20:59] <wikibugs>	 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10Joe) >>! In T223391#5292192, @awight wrote: > @jijiki We're a bit confused because the beta cluster [[ https://en.wikipedia.b...
[13:22:18] <wikibugs>	 (03PS3) 10Filippo Giunchedi: logstash: add consumer for client errors [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142)
[13:22:32] <wikibugs>	 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10awight) >>! In T223391#5292196, @Joe wrote: >>>! In T223391#5292192, @awight wrote: >> @jijiki We're a bit confused because t...
[13:24:55] <wikibugs>	 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10Joe) Yes :) It's by far the best option.
[13:25:04] <_joe_>	 awight: yeah, sorry, brainfart
[13:25:22] <wikibugs>	 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10awight) Certainly seems to be the case.  Good news, with PHP7 enabled I get the expected results where the new version of wik...
[13:25:43] <awight>	 _joe_: I didn't smell a thing.
[13:31:27] <wikibugs>	 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 2 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10Ottomata) Hm @herron, today we experienced {T226808}, which I think is...
[13:35:20] <tgr>	 awesome!
[13:40:36] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling)
[13:46:18] <ema>	 thedj: hey! Can you load the page with inspector enabled and tell me the value of the X-Cache response header?
[13:49:04] <thedj>	 ema: right now that page returns: cp1089 pass, cp3043 pass, cp3030 pass
[13:49:37] <thedj>	 and a refresh later its cp1083 pass, cp3033 pass, cp3030 pass
[13:49:49] <thedj>	 but these all work instantly
[13:50:55] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) Hi @revi, good point. I *think* Discourse doesn't do this out of the box, but we can investigate (and probably discuss in a task apart). Maybe a plu...
[13:51:47] <ema>	 thedj: ok perfect. The interesting value for me right now is cp3030, the frontend. See https://wikitech.wikimedia.org/wiki/Varnish#X-Cache for an explanation if you're curious :)
[13:53:01] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) The queries for `varnishstatsd` metrics I've been able to find during the audit:  ` (varnish.$dc.backends.be_*api_svc*.GET.sample_rate, 60...
[13:54:39] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10revi) I don't know the exact reason why it is not archived, but I think most plausible reason is, in mailman, it is virtually tooooooo hard to erase stuff...
[14:01:02] <icinga-wm>	 PROBLEM - puppet last run on netflow1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[kafkatee]
[14:04:08] <icinga-wm>	 PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[14:04:17] <wikibugs>	 (03CR) 10CDanis: "this looks pretty good.  given the stack of changes here, I'm not asking for any changes now, just making notes for a followup pass once a" (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans)
[14:06:22] <ema>	 !log depool cp2011 and reimage as upload_ats T226637
[14:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:28] <stashbot>	 T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637
[14:07:06] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: reimage cp2011 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519613 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema)
[14:07:15] <wikibugs>	 (03PS2) 10Ema: cache: reimage cp2011 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519613 (https://phabricator.wikimedia.org/T226637)
[14:07:32] <logmsgbot>	 !log eevans@deploy1001 scap-helm sessionstore upgrade production -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging]
[14:07:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:42] <wikibugs>	 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10jbond) The   >>! In T222356#5167750, @jbond wrote: > correct pull request https://github.com/puppetlabs/facter/pull/1775  This has now been merged and i have build a new package [avalibl...
[14:10:38] <wikibugs>	 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2011.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim...
[14:11:29] <logmsgbot>	 !log eevans@deploy1001 scap-helm sessionstore upgrade staging -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging]
[14:11:30] <logmsgbot>	 !log eevans@deploy1001 scap-helm sessionstore cluster staging completed
[14:11:30] <logmsgbot>	 !log eevans@deploy1001 scap-helm sessionstore finished
[14:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:19] <wikibugs>	 (03CR) 10CDanis: "one question, one thing to fix afterwards" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 (owner: 10Volans)
[14:18:27] <wikibugs>	 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul)
[14:19:57] <icinga-wm>	 RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[14:29:03] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] configuration: change IRC default values [software/conftool] - 10https://gerrit.wikimedia.org/r/519459 (owner: 10Volans)
[14:29:09] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbconfig: allow to remote paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460 (owner: 10Volans)
[14:31:48] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: mediawiki::php: add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling)
[14:33:07] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbconfig: improve config commit and restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519461 (owner: 10Volans)
[14:33:33] <wikibugs>	 (03PS2) 10Volans: dbconfig: structure return values of actions [software/conftool] - 10https://gerrit.wikimedia.org/r/519457
[14:33:35] <wikibugs>	 (03PS2) 10Volans: dbconfig: config diff, on non-empty diff exit 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/519458
[14:33:37] <wikibugs>	 (03PS2) 10Volans: configuration: change IRC default values [software/conftool] - 10https://gerrit.wikimedia.org/r/519459
[14:33:39] <wikibugs>	 (03PS2) 10Volans: dbconfig: allow to remotely paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460
[14:33:41] <wikibugs>	 (03PS2) 10Volans: dbconfig: improve config commit and restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519461
[14:34:11] <wikibugs>	 (03CR) 10Volans: "comments addressed" (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans)
[14:34:36] <wikibugs>	 (03CR) 10Volans: "comment addressed, question answered" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 (owner: 10Volans)
[14:35:17] <wikibugs>	 (03CR) 10CDanis: dbconfig: structure return values of actions (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans)
[14:37:01] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbconfig: config diff, on non-empty diff exit 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 (owner: 10Volans)
[14:38:57] <wikibugs>	 (03CR) 10Volans: "missed done..." (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans)
[14:41:29] <wikibugs>	 (03PS3) 10Volans: dbconfig: structure return values of actions [software/conftool] - 10https://gerrit.wikimedia.org/r/519457
[14:41:31] <wikibugs>	 (03PS3) 10Volans: dbconfig: config diff, on non-empty diff exit 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/519458
[14:41:33] <wikibugs>	 (03PS3) 10Volans: configuration: change IRC default values [software/conftool] - 10https://gerrit.wikimedia.org/r/519459
[14:41:35] <wikibugs>	 (03PS3) 10Volans: dbconfig: allow to remotely paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460
[14:41:37] <wikibugs>	 (03PS3) 10Volans: dbconfig: improve config commit and restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519461
[14:42:09] <wikibugs>	 (03PS1) 10Ladsgroup: labs: Enable jsonld for entity type for beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519646 (https://phabricator.wikimedia.org/T226472)
[14:42:20] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbconfig: structure return values of actions [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans)
[14:43:25] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519646 (https://phabricator.wikimedia.org/T226472) (owner: 10Ladsgroup)
[14:44:21] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Enable jsonld for entity type for beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519646 (https://phabricator.wikimedia.org/T226472) (owner: 10Ladsgroup)
[14:44:49] <Amir1>	 ^ rebased on deploy1001
[14:45:17] <wikibugs>	 (03Merged) 10jenkins-bot: dbconfig: structure return values of actions [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans)
[14:45:19] <wikibugs>	 (03Merged) 10jenkins-bot: dbconfig: config diff, on non-empty diff exit 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 (owner: 10Volans)
[14:45:21] <wikibugs>	 (03Merged) 10jenkins-bot: configuration: change IRC default values [software/conftool] - 10https://gerrit.wikimedia.org/r/519459 (owner: 10Volans)
[14:45:37] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) Well, yes, I was about to ask why.  :)  In Discourse entire topics (threads) or specific posts (messages) can be deleted by admins and moderators fr...
[14:45:52] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbconfig: allow to remotely paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460 (owner: 10Volans)
[14:46:22] <wikibugs>	 (03CR) 10jenkins-bot: labs: Enable jsonld for entity type for beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519646 (https://phabricator.wikimedia.org/T226472) (owner: 10Ladsgroup)
[14:47:25] <XioNoX>	 !log upload kafkatee to buster-wikimedia
[14:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:27] <wikibugs>	 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2011.codfw.wmnet'] `  and were **ALL** successful.
[14:48:29] <wikibugs>	 (03Merged) 10jenkins-bot: dbconfig: allow to remotely paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460 (owner: 10Volans)
[14:48:31] <wikibugs>	 (03Merged) 10jenkins-bot: dbconfig: improve config commit and restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519461 (owner: 10Volans)
[14:48:40] <ema>	 !log pool cp2011 w/ ATS backend T226637
[14:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:45] <stashbot>	 T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637
[14:49:02] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) >>! In T204056#5285817, @jcrespo wrote: > This is blocked on @CRoslof or someone else from legal.  Is there any more information we can provide on the i...
[14:50:07] <icinga-wm>	 RECOVERY - puppet last run on netflow1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[14:51:11] <icinga-wm>	 PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:54:16] <wikibugs>	 (03CR) 10CDanis: "Hashar, are you still looking for reviews on these Swift changes?" [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar)
[14:59:13] <icinga-wm>	 PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100%
[14:59:39] <icinga-wm>	 PROBLEM - Host ganeti2009 is DOWN: PING CRITICAL - Packet loss = 100%
[14:59:51] <icinga-wm>	 PROBLEM - Host ganeti2010 is DOWN: PING CRITICAL - Packet loss = 100%
[14:59:59] <icinga-wm>	 PROBLEM - Host ganeti2013 is DOWN: PING CRITICAL - Packet loss = 100%
[15:00:05] <icinga-wm>	 PROBLEM - Host ganeti2012 is DOWN: PING CRITICAL - Packet loss = 100%
[15:01:55] <akosiaris>	 ignore those ^
[15:02:00] <akosiaris>	 icinga was faster than me
[15:02:13] <icinga-wm>	 RECOVERY - Host ganeti2010 is UP: PING WARNING - Packet loss = 64%, RTA = 36.22 ms
[15:02:15] <logmsgbot>	 !log eevans@deploy1001 scap-helm sessionstore upgrade production -f sessionstore-eqiad-values.yaml stable/kask [namespace: sessionstore, clusters: eqiad]
[15:02:15] <icinga-wm>	 RECOVERY - Host ganeti2014 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms
[15:02:15] <logmsgbot>	 !log eevans@deploy1001 scap-helm sessionstore cluster eqiad completed
[15:02:16] <logmsgbot>	 !log eevans@deploy1001 scap-helm sessionstore finished
[15:02:19] <icinga-wm>	 RECOVERY - Host ganeti2009 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms
[15:02:19] <icinga-wm>	 RECOVERY - Host ganeti2013 is UP: PING OK - Packet loss = 0%, RTA = 36.31 ms
[15:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:31] <icinga-wm>	 RECOVERY - Host ganeti2012 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[15:02:31] <icinga-wm>	 PROBLEM - Host ganeti2011 is DOWN: PING CRITICAL - Packet loss = 100%
[15:02:31] <icinga-wm>	 PROBLEM - Host ganeti2015 is DOWN: PING CRITICAL - Packet loss = 100%
[15:02:31] <icinga-wm>	 PROBLEM - Host ganeti2016 is DOWN: PING CRITICAL - Packet loss = 100%
[15:02:31] <icinga-wm>	 PROBLEM - Host ganeti2017 is DOWN: PING CRITICAL - Packet loss = 100%
[15:02:59] <icinga-wm>	 RECOVERY - Host ganeti2011 is UP: PING WARNING - Packet loss = 64%, RTA = 36.24 ms
[15:04:41] <icinga-wm>	 RECOVERY - Host ganeti2015 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
[15:04:41] <icinga-wm>	 RECOVERY - Host ganeti2016 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms
[15:04:41] <icinga-wm>	 RECOVERY - Host ganeti2017 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[15:04:43] <logmsgbot>	 !log eevans@deploy1001 scap-helm sessionstore upgrade production -f sessionstore-codfw-values.yaml stable/kask [namespace: sessionstore, clusters: codfw]
[15:04:44] <logmsgbot>	 !log eevans@deploy1001 scap-helm sessionstore cluster codfw completed
[15:04:44] <logmsgbot>	 !log eevans@deploy1001 scap-helm sessionstore finished
[15:04:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:51] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) The expiration for objects can be specified at the time of upload, so it needs to be added to our current wo...
[15:20:57] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) @EBernhardson @Ottomata re: swift expiring objects, see the link above too and tl;dr is:  The X-Delete-...
[15:22:24] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) Great!  @fgiunchedi you said 'that is something we'd have to deploy first'.  Can I use this now?
[15:29:52] <wikibugs>	 (03PS1) 10Andrew Bogott: fullstack monitoring: adjust behavior of the leak counter [puppet] - 10https://gerrit.wikimedia.org/r/519653 (https://phabricator.wikimedia.org/T226647)
[15:31:02] <wikibugs>	 (03PS2) 10Andrew Bogott: fullstack monitoring: adjust behavior of the leak counter [puppet] - 10https://gerrit.wikimedia.org/r/519653 (https://phabricator.wikimedia.org/T226647)
[15:31:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] fullstack monitoring: adjust behavior of the leak counter [puppet] - 10https://gerrit.wikimedia.org/r/519653 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott)
[15:37:30] <wikibugs>	 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Eevans) >>! In T222960#5289614, @Cmjohnson wrote: > @Eevans Do you still want to move this se...
[15:41:27] <wikibugs>	 (03PS1) 10Jhedden: pdns: add rec_control profile to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/519656 (https://phabricator.wikimedia.org/T224688)
[15:42:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pdns: add rec_control profile to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/519656 (https://phabricator.wikimedia.org/T224688) (owner: 10Jhedden)
[15:43:36] <wikibugs>	 (03PS2) 10Jhedden: pdns: add rec_control profile to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/519656 (https://phabricator.wikimedia.org/T224688)
[15:48:43] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) >>! In T213976#5292498, @Ottomata wrote: > Great!  @fgiunchedi you said 'that is something we'd have to...
[15:49:29] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) Oh ok, will do!
[15:51:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] pdns: add rec_control profile to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/519656 (https://phabricator.wikimedia.org/T224688) (owner: 10Jhedden)
[15:57:39] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] pdns: add rec_control profile to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/519656 (https://phabricator.wikimedia.org/T224688) (owner: 10Jhedden)
[16:01:43] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10mmodell) >>! In T213976#4968603, @Ottomata wrote: > @mmodell This is kind of a 'deployment' process thing, is this...
[16:01:51] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10mmodell) >>! In T213976#4995886, @Ladsgroup wrote: > Yes, you're right. Maybe turning mwmaint1002 to a minikube and...
[16:01:55] <wikibugs>	 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Wikimedia-production-error: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Krinkle)
[16:02:22] <wikibugs>	 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Krinkle)
[16:04:10] <wikibugs>	 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10CDanis) NB that the default limit in Varnish actually...
[16:11:17] <wikibugs>	 (03PS1) 10BBlack: varnish: temporarily allow more response headers [puppet] - 10https://gerrit.wikimedia.org/r/519661 (https://phabricator.wikimedia.org/T226840)
[16:11:54] <wikibugs>	 (03PS1) 10Cwhite: grafana: fix and update to grafana-dashboard script [puppet] - 10https://gerrit.wikimedia.org/r/519662
[16:12:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] grafana: fix and update to grafana-dashboard script [puppet] - 10https://gerrit.wikimedia.org/r/519662 (owner: 10Cwhite)
[16:15:38] <wikibugs>	 (03PS1) 10Cwhite: grafana: update varnish-aggregate-client-status-codes to prometheus version [puppet] - 10https://gerrit.wikimedia.org/r/519664 (https://phabricator.wikimedia.org/T184942)
[16:16:11] <wikibugs>	 (03Abandoned) 10Cwhite: grafana: remove legacy varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/519410 (https://phabricator.wikimedia.org/T184942) (owner: 10Cwhite)
[16:19:13] <wikibugs>	 (03PS2) 10Cwhite: grafana: fix and update to grafana-dashboard script [puppet] - 10https://gerrit.wikimedia.org/r/519662
[16:32:31] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] varnish: temporarily allow more response headers [puppet] - 10https://gerrit.wikimedia.org/r/519661 (https://phabricator.wikimedia.org/T226840) (owner: 10BBlack)
[16:35:10] <bblack>	 !log Raising varnish max_http_hdr (max allowed applayer response header count) from 64->128 in systemd config and live tuning - https://gerrit.wikimedia.org/r/519661 - T226840
[16:35:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:16] <stashbot>	 T226840: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840
[16:36:17] <wikibugs>	 (03PS2) 10Isaac Johnson: Undeploy reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov)
[16:38:10] <wikibugs>	 (03CR) 10Nuria: ReportUpdater: change repo of all queries to reportupdater-queries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T221064) (owner: 10Fdans)
[16:39:52] <wikibugs>	 10Operations, 10Analytics, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) EventStreams is hitting its concurrent connection limits of about 200 connections.  We think this is probably due to a single cl...
[16:41:37] <wikibugs>	 (03PS2) 10Fsero: introducing helmfile.d values for staging cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/517887 (https://phabricator.wikimedia.org/T212130)
[16:48:21] <wikibugs>	 (03PS1) 10DLynch: Set some wikis to use the mobile-ve-as-default a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196)
[16:51:27] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Reset PW for Code-Health mailing list - https://phabricator.wikimedia.org/T226842 (10Jrbranaa)
[17:02:03] <icinga-wm>	 PROBLEM - puppet last run on ganeti2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[drbd8-utils]
[17:02:19] <chaomodus>	 ah
[17:05:03] <wikibugs>	 (03PS1) 10BBlack: Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519670 (https://phabricator.wikimedia.org/T226840)
[17:05:43] <wikibugs>	 (03CR) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett)
[17:06:30] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519670 (https://phabricator.wikimedia.org/T226840) (owner: 10BBlack)
[17:11:31] <icinga-wm>	 PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:11:35] <icinga-wm>	 PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:11:47] <icinga-wm>	 PROBLEM - puppet last run on mw1331 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:11:52] <wikibugs>	 10Operations, 10hardware-requests, 10Discovery-Search (Current work): Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10RobH)
[17:11:55] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey)
[17:12:15] <icinga-wm>	 PROBLEM - puppet last run on cp1087 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:12:23] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:12:23] <icinga-wm>	 PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:12:47] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:13:05] <icinga-wm>	 PROBLEM - puppet last run on elastic1052 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:13:07] <icinga-wm>	 PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:13:13] <icinga-wm>	 PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:13:21] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@8d6fa30]: Laste regular analytics weekly deploy
[17:13:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:33] <icinga-wm>	 PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:13:33] <icinga-wm>	 PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:13:37] <wikibugs>	 (03PS1) 10BBlack: Revert "Increase nginx limits on http resp hdr block size" [puppet] - 10https://gerrit.wikimedia.org/r/519672
[17:13:49] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "Increase nginx limits on http resp hdr block size" [puppet] - 10https://gerrit.wikimedia.org/r/519672 (owner: 10BBlack)
[17:13:51] <icinga-wm>	 PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:13:51] <icinga-wm>	 PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:13:51] <wikibugs>	 (03CR) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi)
[17:13:57] <icinga-wm>	 PROBLEM - puppet last run on mw2275 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:13:59] <icinga-wm>	 PROBLEM - puppet last run on mw2155 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:13:59] <icinga-wm>	 PROBLEM - puppet last run on mw2178 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:14:03] <icinga-wm>	 PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:14:33] <mutante>	 known and not breaking prod ^
[17:14:38] <bblack>	 nginx-reload issues are from the patch I'm reverting.  either way it's noise and non-affecting of prod traffic
[17:14:47] <icinga-wm>	 PROBLEM - puppet last run on elastic2026 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:14:49] <icinga-wm>	 PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:14:53] <icinga-wm>	 PROBLEM - puppet last run on cp1078 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:01] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:01] <icinga-wm>	 PROBLEM - puppet last run on mw1321 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:17] <icinga-wm>	 PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:27] <icinga-wm>	 PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:31] <icinga-wm>	 PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:33] <icinga-wm>	 PROBLEM - puppet last run on cp1081 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:35] <icinga-wm>	 PROBLEM - puppet last run on cp5007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:41] <icinga-wm>	 PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:47] <icinga-wm>	 PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:51] <icinga-wm>	 PROBLEM - puppet last run on mw2273 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:15:57] <icinga-wm>	 PROBLEM - puppet last run on cp4026 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:03] <icinga-wm>	 PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:21] <icinga-wm>	 PROBLEM - puppet last run on elastic2034 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:27] <icinga-wm>	 PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:27] <icinga-wm>	 PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:27] <icinga-wm>	 PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:27] <icinga-wm>	 PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:31] <icinga-wm>	 PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:33] <icinga-wm>	 PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:35] <icinga-wm>	 PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:35] <icinga-wm>	 PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:35] <icinga-wm>	 PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:37] <icinga-wm>	 PROBLEM - puppet last run on elastic2043 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:37] <icinga-wm>	 PROBLEM - puppet last run on mw1333 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:45] <icinga-wm>	 PROBLEM - puppet last run on mw2278 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:47] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) 05Stalled→03Open
[17:16:55] <icinga-wm>	 PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:16:55] <icinga-wm>	 PROBLEM - puppet last run on cp1079 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:07] <icinga-wm>	 PROBLEM - puppet last run on mw1313 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:07] <icinga-wm>	 PROBLEM - puppet last run on cp5004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:13] <icinga-wm>	 PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:21] <icinga-wm>	 PROBLEM - puppet last run on mw2282 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:21] <icinga-wm>	 PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:23] <icinga-wm>	 PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:35] <icinga-wm>	 PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:35] <icinga-wm>	 PROBLEM - puppet last run on mw2219 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:45] <icinga-wm>	 PROBLEM - puppet last run on elastic2031 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:45] <icinga-wm>	 PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:55] <icinga-wm>	 PROBLEM - puppet last run on mw2269 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:55] <icinga-wm>	 PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:55] <icinga-wm>	 PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:57] <icinga-wm>	 PROBLEM - puppet last run on mw2254 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:17:57] <icinga-wm>	 PROBLEM - puppet last run on mw1315 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:01] <icinga-wm>	 PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:05] <icinga-wm>	 PROBLEM - puppet last run on elastic2027 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:09] <icinga-wm>	 PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:31] <icinga-wm>	 PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:37] <icinga-wm>	 PROBLEM - puppet last run on elastic2052 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:37] <icinga-wm>	 PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:37] <icinga-wm>	 PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:37] <icinga-wm>	 PROBLEM - puppet last run on cp4032 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:41] <icinga-wm>	 PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:47] <icinga-wm>	 PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:53] <icinga-wm>	 PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:55] <icinga-wm>	 PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:18:55] <icinga-wm>	 PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:01] <icinga-wm>	 PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:19] <icinga-wm>	 PROBLEM - puppet last run on mw1344 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:19] <icinga-wm>	 PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:19] <icinga-wm>	 PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:19] <icinga-wm>	 PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:21] <icinga-wm>	 PROBLEM - puppet last run on elastic2038 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:25] <icinga-wm>	 PROBLEM - puppet last run on mw2264 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:25] <icinga-wm>	 PROBLEM - puppet last run on mw2272 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:27] <icinga-wm>	 PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:27] <icinga-wm>	 PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:33] <icinga-wm>	 PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:33] <icinga-wm>	 PROBLEM - puppet last run on mw2159 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:33] <icinga-wm>	 PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:43] <icinga-wm>	 PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:43] <icinga-wm>	 PROBLEM - puppet last run on cp4024 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:19:49] <wikibugs>	 (03PS1) 10CDanis: Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519675
[17:19:53] <icinga-wm>	 PROBLEM - puppet last run on elastic2025 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:20:15] <icinga-wm>	 PROBLEM - puppet last run on cp4029 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:20:17] <icinga-wm>	 PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:20:21] <icinga-wm>	 PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:20:43] <icinga-wm>	 PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:20:43] <icinga-wm>	 PROBLEM - puppet last run on mw1330 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:20:47] <icinga-wm>	 PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:21:07] <icinga-wm>	 PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:21:13] <icinga-wm>	 PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:21:19] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:21:23] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:21:25] <icinga-wm>	 PROBLEM - puppet last run on elastic1049 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:22:30] <wikibugs>	 10Operations, 10Analytics, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Pchelolo)
[17:23:37] <icinga-wm>	 PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload]
[17:24:31] <icinga-wm>	 PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 4948 MB (3% inode=82%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[17:26:01] <wikibugs>	 (03PS2) 10CDanis: Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519675 (https://phabricator.wikimedia.org/T226840)
[17:26:53] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:27:25] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:27:49] <icinga-wm>	 PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 4673 MB (3% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[17:28:36] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey)
[17:29:36] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey)
[17:31:19] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:31:36] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey)
[17:31:49] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:32:07] <wikibugs>	 (03PS3) 10CDanis: Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519675 (https://phabricator.wikimedia.org/T226840)
[17:33:05] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:33:09] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:36:32] <ottomata>	 !log restarting eventstreams on scb1001 with trace logging of X-Client-IP  for  T226808
[17:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:37] <stashbot>	 T226808: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808
[17:37:09] <ottomata>	 huh elukey did you know
[17:37:09] <ottomata>	 https://config-master.wikimedia.org/pybal/eqiad/eventstreams
[17:37:15] <ottomata>	 the scbs have different weights!?
[17:37:26] <wikibugs>	 (03PS4) 10CDanis: Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519675 (https://phabricator.wikimedia.org/T226840)
[17:37:54] <elukey>	 nope
[17:38:25] <icinga-wm>	 RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[17:38:29] <icinga-wm>	 RECOVERY - puppet last run on elastic2043 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[17:38:47] <icinga-wm>	 RECOVERY - puppet last run on cp1079 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[17:38:51] <icinga-wm>	 RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[17:38:55] <icinga-wm>	 RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[17:39:05] <icinga-wm>	 RECOVERY - puppet last run on mw1331 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[17:39:39] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:39:39] <icinga-wm>	 RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[17:39:59] <icinga-wm>	 RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[17:40:03] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:40:23] <icinga-wm>	 RECOVERY - puppet last run on elastic1052 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:40:23] <icinga-wm>	 RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:40:31] <icinga-wm>	 RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:40:51] <icinga-wm>	 RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:40:51] <icinga-wm>	 RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:41:07] <icinga-wm>	 RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[17:41:09] <icinga-wm>	 RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[17:41:13] <icinga-wm>	 RECOVERY - puppet last run on elastic2038 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[17:41:17] <icinga-wm>	 RECOVERY - puppet last run on mw2275 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:41:19] <icinga-wm>	 RECOVERY - puppet last run on mw2155 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:41:19] <icinga-wm>	 RECOVERY - puppet last run on mw2178 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[17:41:21] <icinga-wm>	 RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:41:23] <ottomata>	 elukey: do you know
[17:41:27] <ottomata>	 is this still the proper procedure?
[17:41:27] <ottomata>	 https://wikitech.wikimedia.org/wiki/LVS#Pool_or_depool_hosts_(for_non-Etcd_managed_pools)
[17:42:05] <wikibugs>	 (03CR) 10Esanders: [C: 03+1] Set some wikis to use the mobile-ve-as-default a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) (owner: 10DLynch)
[17:42:09] <icinga-wm>	 RECOVERY - puppet last run on elastic2026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:42:09] <icinga-wm>	 RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[17:42:13] <icinga-wm>	 RECOVERY - puppet last run on cp1078 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[17:42:19] <icinga-wm>	 RECOVERY - puppet last run on mw1321 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[17:42:19] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1005 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[17:42:33] <icinga-wm>	 RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[17:42:33] <icinga-wm>	 RECOVERY - puppet last run on mw1330 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[17:42:35] <icinga-wm>	 RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:42:49] <icinga-wm>	 RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:42:53] <icinga-wm>	 RECOVERY - puppet last run on cp1081 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[17:42:55] <icinga-wm>	 RECOVERY - puppet last run on cp5007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:42:57] <icinga-wm>	 RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[17:42:59] <icinga-wm>	 RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:43:07] <icinga-wm>	 RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[17:43:09] <icinga-wm>	 RECOVERY - puppet last run on mw2273 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[17:43:17] <icinga-wm>	 RECOVERY - puppet last run on cp4026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:43:23] <icinga-wm>	 RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:43:39] <icinga-wm>	 RECOVERY - puppet last run on elastic2034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:43:41] <icinga-wm>	 RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[17:43:47] <icinga-wm>	 RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:43:47] <icinga-wm>	 RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[17:43:47] <icinga-wm>	 RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:43:47] <icinga-wm>	 RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[17:43:51] <icinga-wm>	 RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:43:53] <icinga-wm>	 RECOVERY - puppet last run on elastic1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:43:55] <icinga-wm>	 RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:43:55] <icinga-wm>	 RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[17:43:57] <icinga-wm>	 RECOVERY - puppet last run on mw1333 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[17:44:02] <elukey>	 ottomata: no no
[17:44:05] <icinga-wm>	 RECOVERY - puppet last run on mw2278 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:44:15] <icinga-wm>	 RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:44:22] <ottomata>	 yeahh seemed wrong...
[17:44:25] <icinga-wm>	 RECOVERY - puppet last run on mw1313 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:44:27] <icinga-wm>	 RECOVERY - puppet last run on cp5004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[17:44:33] <icinga-wm>	 RECOVERY - puppet last run on mw2215 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:44:39] <elukey>	 ottomata: there is a confctl tool on puppet master
[17:44:39] <icinga-wm>	 RECOVERY - puppet last run on mw2282 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:44:39] <icinga-wm>	 RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:44:41] <icinga-wm>	 RECOVERY - puppet last run on mw2249 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:44:53] <icinga-wm>	 RECOVERY - puppet last run on mw2219 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:44:53] <icinga-wm>	 RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:45:01] <icinga-wm>	 RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:45:01] <icinga-wm>	 RECOVERY - puppet last run on elastic2031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:45:01] <icinga-wm>	 PROBLEM - puppet last run on cp1087 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 9 seconds ago with 3 failures. Failed resources (up to 3 shown): Service[nginx],Exec[nginx-reload]
[17:45:11] <icinga-wm>	 RECOVERY - puppet last run on mw2269 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:45:11] <icinga-wm>	 RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:45:11] <icinga-wm>	 RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:45:13] <icinga-wm>	 RECOVERY - puppet last run on mw1315 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:45:13] <icinga-wm>	 RECOVERY - puppet last run on mw2254 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:45:17] <icinga-wm>	 RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:45:21] <icinga-wm>	 RECOVERY - puppet last run on elastic2027 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:45:21] <ottomata>	 ok found more docs
[17:45:27] <ottomata>	 haven't depooled an individual service in a while
[17:45:30] <elukey>	 ottomata: should be  sudo -i confctl --quiet depool --hostname scb1001.eqiad.wmnet --service eventstreams
[17:45:46] <ottomata>	 ok thank you
[17:45:47] <ottomata>	 running that
[17:45:47] <icinga-wm>	 RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:45:55] <icinga-wm>	 RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:45:55] <icinga-wm>	 RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:45:55] <icinga-wm>	 RECOVERY - puppet last run on elastic2052 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:45:55] <icinga-wm>	 RECOVERY - puppet last run on cp4032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:45:57] <icinga-wm>	 RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:46:05] <icinga-wm>	 RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:46:11] <icinga-wm>	 RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:46:11] <icinga-wm>	 RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:46:11] <icinga-wm>	 RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:46:17] <icinga-wm>	 RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:46:37] <icinga-wm>	 RECOVERY - puppet last run on mw1344 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:46:37] <icinga-wm>	 RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:46:37] <icinga-wm>	 RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:46:37] <icinga-wm>	 RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:46:43] <icinga-wm>	 RECOVERY - puppet last run on mw2264 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:46:43] <icinga-wm>	 RECOVERY - puppet last run on mw2272 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:46:45] <icinga-wm>	 RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:46:45] <icinga-wm>	 RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:46:49] <icinga-wm>	 RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:46:51] <icinga-wm>	 RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:46:51] <icinga-wm>	 RECOVERY - puppet last run on mw2159 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:47:01] <icinga-wm>	 RECOVERY - puppet last run on cp4024 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:47:01] <icinga-wm>	 RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:47:11] <icinga-wm>	 RECOVERY - puppet last run on elastic2025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:47:35] <icinga-wm>	 RECOVERY - puppet last run on cp4029 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:47:35] <icinga-wm>	 RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:47:39] <icinga-wm>	 RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:48:07] <icinga-wm>	 RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[17:48:15] <icinga-wm>	 RECOVERY - puppet last run on mw1320 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:48:31] <icinga-wm>	 RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:48:43] <icinga-wm>	 RECOVERY - puppet last run on elastic1049 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:49:12] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519675 (https://phabricator.wikimedia.org/T226840) (owner: 10CDanis)
[17:50:51] <icinga-wm>	 RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[17:53:19] <cdanis>	 !log increasing nginx proxy_buffer_size / proxy_buffers 02d7bcaa
[17:53:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:55] <icinga-wm>	 RECOVERY - puppet last run on cp1087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:06:01] <icinga-wm>	 RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[18:06:51] <wikibugs>	 (03PS4) 10Bstorm: toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531)
[18:06:56] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@8d6fa30]: Laste regular analytics weekly deploy (duration: 53m 35s)
[18:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:26] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only
[18:08:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:30] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only (duration: 00m 04s)
[18:08:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:04] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only
[18:09:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:09] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only (duration: 00m 05s)
[18:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:43] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only again
[18:11:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:09] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only again (duration: 00m 26s)
[18:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:18] <elukey>	 !log systemctl reset-failed kafka* units on kafka2001 (in decom phase)
[18:12:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:32] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1004 only
[18:13:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:27] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10User-greg: Reset PW for Code-Health mailing list - https://phabricator.wikimedia.org/T226842 (10greg) 05Open→03Resolved a:03greg I have it, I can share with @Jrbranaa .
[18:14:34] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1004 only (duration: 01m 03s)
[18:14:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Thank you for doing this!  I can't think of any reason why /adding/ things to this repo would do any harm, so the only real bits of intere" [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond)
[18:23:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "(I'm happy to merge and babysit if you'd like me to, just lmk)" [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond)
[18:25:20] <wikibugs>	 (03CR) 10Alex Monk: "+1 to the concept" [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond)
[18:27:22] <wikibugs>	 (03PS4) 10Jbond: missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051
[18:33:25] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@de8eb99]: Missing bit of regular analytics deploy
[18:33:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:35] <wikibugs>	 10Operations, 10Analytics, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) Collected some info about which IPs were connecting on scb1001.  Over a period of about 40 minutes:        3 "100.26....
[18:42:03] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:42:07] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:44:39] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:45:09] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:45:15] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[18:45:50] <wikibugs>	 (03PS5) 10Andrew Bogott: missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond)
[18:46:00] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond)
[18:46:23] <wikibugs>	 (03PS3) 10Andrew Bogott: audit_hiera: This is a small script to audit the private repo [labs/private] - 10https://gerrit.wikimedia.org/r/519050 (https://phabricator.wikimedia.org/T226530) (owner: 10Jbond)
[18:46:34] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] audit_hiera: This is a small script to audit the private repo [labs/private] - 10https://gerrit.wikimedia.org/r/519050 (https://phabricator.wikimedia.org/T226530) (owner: 10Jbond)
[18:47:16] <wikibugs>	 (03PS6) 10Andrew Bogott: missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond)
[18:47:19] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond)
[18:48:05] <icinga-wm>	 RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[18:50:31] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:51:01] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:51:13] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@de8eb99]: Missing bit of regular analytics deploy (duration: 17m 47s)
[18:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:17] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:52:21] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:03:12] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@de8eb99]: Missing bit of regular analytics deploy
[19:03:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:20] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@de8eb99]: Missing bit of regular analytics deploy (duration: 02m 08s)
[19:05:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:14] <wikibugs>	 (03CR) 10Jforrester: "Note to deployer: mobile.php is CommonSettings.php-like, so the IS change has to be deployed first or you'll break the world." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) (owner: 10DLynch)
[19:18:40] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::data_purge Migrate webrequest timers to new deletion script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862)
[19:19:02] <wikibugs>	 (03PS2) 10Mforns: analytics::refinery::job::data_purge Migrate webrequest timers to new deletion script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862)
[19:19:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] analytics::refinery::job::data_purge Migrate webrequest timers to new deletion script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[19:20:58] <wikibugs>	 (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[19:23:26] <wikibugs>	 (03PS3) 10Mforns: analytics::refinery::job::data_purge Migrate webrequest timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862)
[19:28:02] <wikibugs>	 (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[19:31:25] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::data_purge Migrate EL timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519685 (https://phabricator.wikimedia.org/T226862)
[19:33:12] <wikibugs>	 (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519685 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[19:36:22] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::data_purge Remove timer for WDQS extract [puppet] - 10https://gerrit.wikimedia.org/r/519688 (https://phabricator.wikimedia.org/T226862)
[19:40:05] <James_F>	 Friday deploy ahoy. 
[19:40:09] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::data_purge Migrate mediawiki timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519690 (https://phabricator.wikimedia.org/T226862)
[19:41:04] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Set some wikis to use the mobile-ve-as-default a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) (owner: 10DLynch)
[19:41:06] <wikibugs>	 (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519690 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[19:42:04] <wikibugs>	 (03Merged) 10jenkins-bot: Set some wikis to use the mobile-ve-as-default a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) (owner: 10DLynch)
[19:42:19] <wikibugs>	 (03CR) 10jenkins-bot: Set some wikis to use the mobile-ve-as-default a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) (owner: 10DLynch)
[19:45:06] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::data_purge Migrate banner timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519691 (https://phabricator.wikimedia.org/T226862)
[19:46:56] <wikibugs>	 (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519691 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[19:48:15] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T221196 VE mobile A/B test part 1 (duration: 00m 50s)
[19:48:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:22] <stashbot>	 T221196: Set VE as default for target wikis in A/B test - https://phabricator.wikimedia.org/T221196
[19:48:42] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::data_purge Migrate geoeditors timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519693 (https://phabricator.wikimedia.org/T226862)
[19:49:26] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/mobile.php: T221196 VE mobile A/B test part 2 (duration: 00m 49s)
[19:49:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:49:50] <wikibugs>	 (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519693 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[19:50:40] <wikibugs>	 (03CR) 10Mforns: "WDQS extract does no longer exist." [puppet] - 10https://gerrit.wikimedia.org/r/519688 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns)
[19:59:19] <wikibugs>	 (03PS1) 10Bstorm: toolforge: the kubeadm repo can't be labeled trusted in puppet apparently [puppet] - 10https://gerrit.wikimedia.org/r/519696 (https://phabricator.wikimedia.org/T215531)
[19:59:41] <James_F>	 Lucas_WMDE: Welcome. ;-)
[19:59:55] * James_F is having fun, waiting for CI.
[20:00:02] <Lucas_WMDE>	 thanks ^^
[20:00:09] * Lucas_WMDE peeks at the log
[20:00:21] <Lucas_WMDE>	 looks like the backport isn’t your only friday deployment today
[20:01:00] <James_F>	 Yeah, given I was deploying anyway I helped the Editing team out with their forgotten config patch.
[20:06:32] <James_F>	 Looks good, was able to move on TestCommons.
[20:07:22] <Lucas_WMDE>	 Wikibase editing still seems to work there as well
[20:07:30] <James_F>	 OK, deploying.
[20:08:07] <James_F>	 And no surprise move tabs on Q pages.
[20:08:24] <Lucas_WMDE>	 good point
[20:08:46] <James_F>	 Lucas_WMDE: Do you want me to update on Commons village pump?
[20:08:53] <James_F>	 You should go have your Friday evening. :-)
[20:09:02] <logmsgbot>	 !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/Wikibase/repo/RepoHooks.php: Make it possible for File pages to be moved on Commons again T224303 T226672 (duration: 00m 50s)
[20:09:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:09] <stashbot>	 T224303: Wikibase Repo prevents page moves in NS 0 on Commons - https://phabricator.wikimedia.org/T224303
[20:09:09] <stashbot>	 T226672: File move/ rename tab not working in Commons - https://phabricator.wikimedia.org/T226672
[20:09:13] <Lucas_WMDE>	 please update, yes :)
[20:09:23] <Lucas_WMDE>	 but my Friday evening is fine tyvm :D
[20:09:30] * James_F grins.
[20:10:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1327 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:12:15] <icinga-wm>	 RECOVERY - HHVM rendering on mw1327 is OK: HTTP OK: HTTP/1.1 200 OK - 81581 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:12:34] <Lucas_WMDE>	 fatalmonitor looks fine
[20:12:40] <James_F>	 Yup.
[20:22:06] <wikibugs>	 (03PS2) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660)
[20:23:13] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: the kubeadm repo can't be labeled trusted in puppet apparently [puppet] - 10https://gerrit.wikimedia.org/r/519696 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm)
[20:25:34] <wikibugs>	 (03PS3) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660)
[20:27:21] <wikibugs>	 (03CR) 10Gergő Tisza: Add rate limiter to Special:ConfirmEmail - config change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett)
[20:49:06] <wikibugs>	 (03PS4) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660)
[20:49:35] <wikibugs>	 (03CR) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi)
[20:51:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "Just noticed wikimania2018wiki is in /dblists/commonsuploads.dblist. I feel there can be similar problems as there were with closing priva" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio)
[20:58:55] <wikibugs>	 (03PS1) 10Jhedden: icinga: fix tools checker stretch jobs [puppet] - 10https://gerrit.wikimedia.org/r/519718 (https://phabricator.wikimedia.org/T213413)
[20:59:03] <wikibugs>	 10Operations, 10Analytics, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata)
[21:03:05] <Lucas_WMDE>	 looks like everything’s fine after the deploy so I’m logging off now, have a nice weekend :)
[21:16:47] <logmsgbot>	 !log otto@deploy1001 Started deploy [eventstreams/deploy@2af2719]: Manually blacklisting IP - T226808
[21:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:53] <stashbot>	 T226808: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808
[21:19:54] <logmsgbot>	 !log otto@deploy1001 Finished deploy [eventstreams/deploy@2af2719]: Manually blacklisting IP - T226808 (duration: 03m 07s)
[21:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:54] <wikibugs>	 10Operations, 10Analytics, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) To hold us over on the weekend, I've manually blacklisted the offending IP in Eve...
[21:32:21] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] icinga: fix tools checker stretch jobs [puppet] - 10https://gerrit.wikimedia.org/r/519718 (https://phabricator.wikimedia.org/T213413) (owner: 10Jhedden)
[21:33:30] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] icinga: fix tools checker stretch jobs [puppet] - 10https://gerrit.wikimedia.org/r/519718 (https://phabricator.wikimedia.org/T213413) (owner: 10Jhedden)
[21:53:36] <wikibugs>	 (03PS1) 10Bstorm: aptrepo: fix the kubeadm packages to include containerd.io [puppet] - 10https://gerrit.wikimedia.org/r/519726 (https://phabricator.wikimedia.org/T215975)
[22:05:50] <wikibugs>	 (03PS4) 10Jeena Huneidi: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935)
[22:06:29] <icinga-wm>	 PROBLEM - puppet last run on mw2252 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[22:07:13] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:07:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:26:32] <wikibugs>	 (03PS2) 10MarcoAurelio: Close wikimania2018.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188)
[22:27:03] <wikibugs>	 (03PS3) 10MarcoAurelio: Close wikimania2018.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188)
[22:27:31] <wikibugs>	 (03CR) 10MarcoAurelio: "> Can you please remove wikimania2018wiki from" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio)
[22:31:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio)
[22:33:43] <icinga-wm>	 RECOVERY - puppet last run on mw2252 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:48:09] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994 (10faidon) 05Stalled→03Declined I think there's a bit of a confusion. AIUI, nftables can refer to two different things: 1. The nf_tables kernel subsystem 1. The nftables...
[22:55:01] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "I used furl to bypass varnish/apache on P8685 and I see Accept-Language being added in the Vary header." [puppet] - 10https://gerrit.wikimedia.org/r/519606 (https://phabricator.wikimedia.org/T203179) (owner: 10Ema)
[23:03:09] <wikibugs>	 10Operations, 10Analytics, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Nuria) {F29666420}  Well, blocking that one IP had the effect of lowering connections. Give...