[00:00:04] Deploy window NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190628T0000) [00:27:50] Happy Friday [00:36:28] (03PS1) 10Urbanecm: Setup EditorJourney for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519581 (https://phabricator.wikimedia.org/T225737) [00:36:47] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:22:04] !log Killing arclamp-log on webperf1002, no flame graphs for three days, presumably mwlog/redis connection dropped again. T215740 [01:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:10] T215740: Create Icinga check for ArcLamp (xenon-log) service health - https://phabricator.wikimedia.org/T215740 [02:24:40] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [02:24:42] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Jdforrester-WMF) [02:26:27] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 134.8 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [02:26:53] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Jdforrester-WMF) This was a blocker for {T219150}, right? Does {T224857} also block that or is this work s... [02:30:49] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22484160 and 2 seconds [02:35:11] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 989840 and 0 seconds [03:17:53] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:17:59] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [03:18:01] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{ [03:18:01] regated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [03:18:09] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:18:13] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:18:19] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [03:18:27] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:18:39] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:18:39] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [03:18:55] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [03:18:55] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:18:55] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [03:19:15] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:19:29] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/ [03:19:37] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:19:37] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:19:43] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [03:19:49] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:20:01] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:20:03] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [03:20:17] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:20:19] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:20:21] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [03:20:49] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:20:49] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [03:20:49] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:21:21] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 52.22 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:24:17] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 80.61 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:58:54] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10jijiki) >>! In T224491#5291244, @Jdforrester-WMF wrote: > This was a blocker for {T219150}, right? Yeah... [05:35:33] PROBLEM - HHVM rendering on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:36:49] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 82296 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:06:54] 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) p:05Triage→03Normal [06:09:45] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10MoritzMuehlenhoff) [06:09:49] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was resolved via T215975 [06:18:01] (03PS1) 10Muehlenhoff: Add José Pita to LDAP users table [puppet] - 10https://gerrit.wikimedia.org/r/519594 (https://phabricator.wikimedia.org/T226091) [06:18:13] 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) I cannot reproduce the issue right now, but it does look like a strange interaction between the application servers and varnish. >>! In T226776#5290831, @CD... [06:18:58] (03CR) 10Muehlenhoff: [C: 03+2] Add José Pita to LDAP users table [puppet] - 10https://gerrit.wikimedia.org/r/519594 (https://phabricator.wikimedia.org/T226091) (owner: 10Muehlenhoff) [06:22:02] (03PS1) 10Muehlenhoff: Add Camille de Nes to LDAP users table [puppet] - 10https://gerrit.wikimedia.org/r/519595 [06:24:47] 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10TheDJ) >>! In T226776#5291441, @ema wrote: > Why did you conclude that? the second after I refreshed after: > <+wikibugs> (CR) Andrew Bogott: [C: +2] nova-full... [06:24:57] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10ema) @Trizek-WMF: personally, I've t... [06:31:31] PROBLEM - puppet last run on cp5008 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:32:55] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [06:33:17] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [06:33:39] PROBLEM - puppet last run on db2086 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:33:49] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [06:33:57] 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) >>! In T226776#5291445, @TheDJ wrote: > the link started working again. Two things had changed at that time, the thing above and I had logged out and back i... [06:52:14] (03CR) 10Muehlenhoff: [C: 03+2] Add Camille de Nes to LDAP users table [puppet] - 10https://gerrit.wikimedia.org/r/519595 (owner: 10Muehlenhoff) [06:53:25] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [06:53:47] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [06:54:19] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [06:58:41] RECOVERY - puppet last run on cp5008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:29] 10Operations, 10serviceops: cronspam for slow queries in PageAssessments - https://phabricator.wikimedia.org/T197564 (10Joe) a:05Joe→03None [07:00:53] RECOVERY - puppet last run on db2086 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:01:08] (03PS1) 10Ema: cache: reimage cp2005 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519597 (https://phabricator.wikimedia.org/T226637) [07:04:44] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp2005 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519597 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [07:05:48] !log depool cp2005 and reimage as upload_ats T226637 [07:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:54] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [07:06:20] (03CR) 10Ema: [C: 03+2] cache: reimage cp2005 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519597 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [07:09:13] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2005.codfw.wmnet'] ` The log can be found in `... [07:11:00] <_joe_> !log upgrading php-wikidiff2 on the mw canaries, only on php7 - T223391 [07:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:06] T223391: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 [07:17:08] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2009.codfw.wmnet', 'ganeti2010.codfw.wmnet', 'ganeti2011.codfw.wmnet', 'ga... [07:17:48] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10MoritzMuehlenhoff) Hi Leszek, we have two ways to approach this: If you specifically only need Logstash access, we can extend the configuration f... [07:18:10] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:23:20] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Pruem) @Trizek-WMF: It would help if... [07:23:31] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10Joe) a:05jijiki→03Joe I did rollout the new version on the canary servers today. If I don't see higher error rates on mo... [07:35:46] (03CR) 10Gergő Tisza: Add rate limiter to Special:ConfirmEmail - config change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [07:42:34] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:49:00] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2005.codfw.wmnet'] ` and were **ALL** successful. [07:57:52] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [07:57:54] !log pool cp2005 w/ ATS backend T226637 [07:57:56] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) 05Resolved→03Open I need the package for **Stretch**! [07:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:59] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [08:00:07] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Aklapper) For anyone affected by log-in problems who wants to help track them down: Please see and follow https://www.mediawiki.org/wiki/Manual:How_to_debug/Login_problems and report back here. T... [08:01:39] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10MoritzMuehlenhoff) And that is what T215975 provides... [08:02:08] !log updating openssl packages on mw1265 [08:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:33] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2014.codfw.wmnet', 'ganeti2009.codfw.wmnet', 'ganeti2013.codfw.wmnet', 'ganeti2012.codfw.wmnet', 'ganeti2011.codfw.wmnet', 'gan... [08:25:41] (03PS1) 10Elukey: profile::hue: add more specific alarms for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/519601 (https://phabricator.wikimedia.org/T226698) [08:28:07] (03CR) 10Elukey: [C: 03+2] profile::hue: add more specific alarms for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/519601 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [08:29:19] 10Operations, 10Traffic: nginx HTTP 500 rate increase on specific cache hosts - https://phabricator.wikimedia.org/T226805 (10ema) [08:29:32] 10Operations, 10Traffic: mobile commons GET dying in Varnish layer(?) under oddly specific conditions - https://phabricator.wikimedia.org/T226776 (10ema) >>! In T226776#5291447, @ema wrote: > Well, the error response was surely generated by the applayer and not by varnish (the latter only generates synthetic r... [08:30:24] 10Operations, 10Traffic: nginx HTTP 500 rate increase on specific cache hosts - https://phabricator.wikimedia.org/T226805 (10ema) p:05Triage→03Normal [08:42:21] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) Ah eventually I found the entry: ` Name: thirdparty/kubeadm-k8s-docker.com Method: https://download.docker.com/... [08:43:04] !log roll restart of eventstreams on all scb2* nodes, service now working (kafka transport failures logged) [08:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:25] (03CR) 10Filippo Giunchedi: [C: 03+1] Serve JPG when WEBP conversion fails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/519379 (https://phabricator.wikimedia.org/T226707) (owner: 10Gilles) [08:48:19] aaa~. [08:48:40] nice [08:48:57] (03CR) 10Filippo Giunchedi: "There seem to be consensus to use the Prometheus based dashboard, should we be replacing this one with it instead of removal? AIUI the nam" [puppet] - 10https://gerrit.wikimedia.org/r/519410 (https://phabricator.wikimedia.org/T184942) (owner: 10Cwhite) [08:57:23] (03PS1) 10Filippo Giunchedi: logstash: add consumer for client errors [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) [09:05:19] (03PS1) 10Ema: cache: reimage cp2008 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519604 (https://phabricator.wikimedia.org/T226637) [09:06:15] 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) p:05Triage→03Normal [09:07:51] 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) [09:09:44] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp2008 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519604 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [09:10:29] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:10:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:49] !log rebooting releases* hosts for MDS-enabled qemu/kernel [09:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:19] (03PS2) 10Filippo Giunchedi: logstash: add consumer for client errors [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) [09:15:58] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:16:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:45] !log systemctl reset-failed kafka* units on kafka2002 (role spare, failed units, already masked) [09:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:35] !log depool cp2008 and reimage as upload_ats T226637 [09:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:40] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [09:18:12] (03CR) 10Ema: [C: 03+2] cache: reimage cp2008 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519604 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [09:21:44] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2008.codfw.wmnet'] ` The log can be found in `... [09:27:24] (03PS1) 10Ema: vcl: remove Vary:AL workaround for fixcopyright.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/519606 (https://phabricator.wikimedia.org/T203179) [09:28:51] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:28:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:25] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [09:31:08] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [09:35:15] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek) >>! In T225004#5291475, @MoritzMuehlenhoff wrote: > Hi Leszek, > we have two ways to approach this: If you specifically only need Lo... [09:41:05] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash and creating grafana boards / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek) [09:41:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Overall good in premise, comments inline" (0314 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [09:45:26] 10Operations, 10vm-requests: Site: eqiad/codfw 2 VMs each for pool counters - https://phabricator.wikimedia.org/T226811 (10MoritzMuehlenhoff) [09:51:50] (03PS1) 10Elukey: Add missing kerberos config to Hadoop HDFS (test cluster) [puppet] - 10https://gerrit.wikimedia.org/r/519607 (https://phabricator.wikimedia.org/T226698) [09:54:43] (03CR) 10Elukey: [C: 03+2] Add missing kerberos config to Hadoop HDFS (test cluster) [puppet] - 10https://gerrit.wikimedia.org/r/519607 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [09:56:37] (03CR) 10Gehel: [C: 04-1] "See comments inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/506954 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [09:56:56] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['ganeti2009.codfw.wmnet', 'ganeti2010.codfw.wmnet', 'ganeti2011.codfw.wmnet', 'ga... [09:59:10] PROBLEM - Host ganeti2010 is DOWN: PING CRITICAL - Packet loss = 100% [09:59:23] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2008.codfw.wmnet'] ` and were **ALL** successful. [10:03:05] RECOVERY - Host ganeti2010 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [10:06:15] PROBLEM - dhclient process on ganeti2010 is CRITICAL: connect to address 10.192.32.139 port 5666: Connection refused [10:06:20] PROBLEM - puppet last run on ganeti2010 is CRITICAL: connect to address 10.192.32.139 port 5666: Connection refused [10:06:23] PROBLEM - Check systemd state on ganeti2010 is CRITICAL: connect to address 10.192.32.139 port 5666: Connection refused [10:06:44] akosiaris: ^ [10:06:45] PROBLEM - configured eth on ganeti2010 is CRITICAL: connect to address 10.192.32.139 port 5666: Connection refused [10:06:55] PROBLEM - MD RAID on ganeti2010 is CRITICAL: connect to address 10.192.32.139 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:07:02] is that related to your reimage? [10:07:24] ema: yup. It's the only host that was actually successfully imaged in the previous run [10:08:39] 10Operations, 10netops, 10User-fgiunchedi: Add centrallog1001 to syslog servers in network ACLs - https://phabricator.wikimedia.org/T226813 (10fgiunchedi) [10:08:48] akosiaris: ok, anything to do or can I ack the alerts? [10:09:09] I 'll ack the alerts [10:09:13] thanks! [10:09:21] hosts are not in service btw, completely new hosts [10:09:28] !log pool cp2008 w/ ATS backend T226637 [10:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:33] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [10:10:36] (03PS1) 10Elukey: profile::hadoop::backup::namenode: move crons to timers [puppet] - 10https://gerrit.wikimedia.org/r/519610 (https://phabricator.wikimedia.org/T226698) [10:10:41] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:11:19] (03CR) 10Gehel: [C: 04-1] cookbook API: add class API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [10:13:15] (03Abandoned) 10DCausse: Add a new extension point SshExecuteCommandInterceptor [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/502764 (owner: 10DCausse) [10:14:18] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17163/" [puppet] - 10https://gerrit.wikimedia.org/r/519610 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [10:15:20] RECOVERY - MD RAID on ganeti2010 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:15:34] RECOVERY - dhclient process on ganeti2010 is OK: PROCS OK: 0 processes with command name dhclient [10:15:46] RECOVERY - Check systemd state on ganeti2010 is OK: OK - running: The system is fully operational [10:17:57] PROBLEM - Host ganeti2009 is DOWN: PING CRITICAL - Packet loss = 100% [10:19:33] RECOVERY - configured eth on ganeti2010 is OK: OK - interfaces up [10:19:53] RECOVERY - Host ganeti2009 is UP: PING OK - Packet loss = 0%, RTA = 36.30 ms [10:20:06] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/TimedMediaHandler/maintenance/requeueTranscodes.php: Extra filtering option (duration: 00m 51s) [10:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:01] RECOVERY - puppet last run on ganeti2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:25:53] (03PS3) 10Volans: cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) [10:26:58] (03PS1) 10Elukey: profile::hadoop::balancer: remove unused kerberos wrapper [puppet] - 10https://gerrit.wikimedia.org/r/519612 (https://phabricator.wikimedia.org/T226698) [10:27:45] (03CR) 10Elukey: [C: 03+2] profile::hadoop::balancer: remove unused kerberos wrapper [puppet] - 10https://gerrit.wikimedia.org/r/519612 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [10:29:09] 10Operations, 10observability, 10serviceops: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10Joe) [10:31:36] !log running `foreachwiki extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --audio --mime=audio/midi --missing --throttle` on mwmaint1002 in screen T226713 [10:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:44] T226713: Run cleanupTranscodes.php for current midi files - https://phabricator.wikimedia.org/T226713 [10:36:08] 10Operations, 10observability, 10serviceops: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10Joe) One relatively easy way to go could be to use mtail, which we use for quite some other things too. There is even an [[https://gith... [10:36:29] 10Operations, 10observability, 10serviceops: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10Joe) [10:36:51] (03PS4) 10Volans: cookbook API: add class API [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) [10:37:00] 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Gerrit manager rights for Ottomata - https://phabricator.wikimedia.org/T226724 (10jbond) p:05Triage→03Normal [10:37:31] PROBLEM - puppet last run on ganeti2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[drbd8-utils] [10:37:46] (03CR) 10Volans: "I finally had a bit of time to resume this work. See inline comments for the various bits." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [10:39:15] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:39:38] 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Gerrit manager rights for Ottomata - https://phabricator.wikimedia.org/T226724 (10jbond) @Legoktm @hashar are either of you able to help with this? [10:39:40] (03PS1) 10Ema: cache: reimage cp2011 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519613 (https://phabricator.wikimedia.org/T226637) [10:40:34] 10Operations, 10observability, 10serviceops: Gather metrics on request status codes, latencies from the MediaWiki appservers - https://phabricator.wikimedia.org/T226815 (10jbond) p:05Triage→03Normal [10:41:14] 10Operations, 10netops, 10User-fgiunchedi: Add centrallog1001 to syslog servers in network ACLs - https://phabricator.wikimedia.org/T226813 (10jbond) p:05Triage→03Normal [10:41:22] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp2011 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519613 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [10:42:17] 10Operations, 10vm-requests: Site: eqiad/codfw 2 VMs each for pool counters - https://phabricator.wikimedia.org/T226811 (10jbond) p:05Triage→03Normal [10:44:02] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10jbond) p:05Triage→03High [10:45:41] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10jbond) p:05Triage→03Normal [10:48:00] (03CR) 10Volans: "Some replies to the comments inline" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/506954 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [10:48:29] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) Update: it needs more discussion among SRE team :-] [11:00:46] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) Does it persist after clearing cookies for the wiki domain? No existing cookies in incognito When logging in in incognito mode? Doesn't work either When logging in with a different kind of... [11:01:38] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) Probably needs a HAR file at this point [11:03:08] (03PS1) 10Ayounsi: DNS for netflow1001 [dns] - 10https://gerrit.wikimedia.org/r/519616 (https://phabricator.wikimedia.org/T226810) [11:04:24] <_joe_> !log uploading php-wmerrors to thirdparty/php72 - T187147 [11:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:30] T187147: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 [11:07:32] 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Samwalton9) https://stream.wikimedia.org/ seems to be down again? [11:09:57] !log draining kubernetes2005 for applying updates [11:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:30] !log draining kubernetes2006 for applying updates [11:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:28] !log draining kubernetes1005 for applying updates [11:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:58] !log draining kubernetes1006 for applying updates [11:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:19] (03PS2) 10Ayounsi: DNS for netflow1001 [dns] - 10https://gerrit.wikimedia.org/r/519616 (https://phabricator.wikimedia.org/T226810) [11:21:21] (03CR) 10Ayounsi: [C: 03+2] DNS for netflow1001 [dns] - 10https://gerrit.wikimedia.org/r/519616 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [11:33:46] !log restart eventstreams on scb1001 [11:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:34] 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) @Samwalton9 We are currently working on it, there is indeed an issue with the eqiad backend hosts of eventstreams :( [11:36:45] !log roll restart eventstreams on all scb1* nodes [11:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:41] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Urbanecm) I can reproduce this when using mr.wikipedia.org in mr, however, when I try to login using https://mr.wikipedia.org/w/index.php?title=%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4... [11:42:49] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) >>! In T226754#5292040, @Urbanecm wrote: > I can reproduce this when using mr.wikipedia.org in mr, however, when I try to login with [interface switched into English](https://mr.wikipedia.o... [11:45:13] 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) Service should be restored now @Samwalton9 [11:46:46] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Urbanecm) Okay, I think this is caused by local content of mediawiki:Loginprompt. When I copied loginprompt from mrwiki to test.wikipedia.org, I reproduced the issue there as well. When I reverted... [11:49:05] 10Operations, 10Wikimedia-General-or-Unknown: Login error on mrwiki - https://phabricator.wikimedia.org/T226754 (10Reedy) >>! In T226754#5292057, @Urbanecm wrote: > Okay, I think this is caused by local content of mediawiki:Loginprompt. When I copied loginprompt from mrwiki to test.wikipedia.org, I reproduced... [11:49:52] 10Operations, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) We got the following alarm at 10:42 UTC (why only analytics? Shouldn't this be owned by SRE?) ` 10:42 PROBLEM - Check if active Eve... [11:49:59] 10Operations, 10Analytics, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10elukey) [11:59:18] 10Operations, 10Wikimedia-General-or-Unknown: Impossible to log into mrwiki due to broken local "MediaWiki:Loginprompt" page - https://phabricator.wikimedia.org/T226754 (10Aklapper) [12:02:39] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Vort) Problem was reproduced by me j... [12:03:51] (03PS1) 10Ayounsi: Netboot/DHCP for netflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/519620 (https://phabricator.wikimedia.org/T226810) [12:05:37] (03CR) 10Ayounsi: [C: 03+2] Netboot/DHCP for netflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/519620 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [12:16:03] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2012.codfw.wmnet', 'ganeti2011.codfw.wmnet', 'ganeti2009.codfw.wmnet'] ` Of which those **FAILED**: ` ['ganeti2012.codfw.wmnet... [12:17:43] so i JUST had that very slow character by character page download again from esams [12:18:42] 10Operations, 10Discovery-Search, 10hardware-requests: Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10Gehel) a:05Gehel→03RobH This is scheduled to be done in Q1, so we can get started. As a reminder, some preliminary estimate were done in T222104 (not sure what can / should be re... [12:19:49] ema: i have the page open, but the inspector can't check the headers of the request, as i opened the inspector after opening the page.. anything else i can check to find a cause ? [12:19:49] 10Operations, 10Discovery-Search, 10hardware-requests: Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10Gehel) [12:20:19] 10Operations, 10hardware-requests, 10Discovery-Search (Current work): Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10Gehel) [12:20:35] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10TheDJ) Just happened to me again as... [12:28:07] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 102.1 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [12:35:13] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) Classification successfully deployed in ulsfo/codfw/eqdfw/eqord (half-ish of our POPs), will push to the other sites early next week. Then start dropping invalids on IXPs to see the effect it has in term of traffic... [12:36:39] tgr: I'm trying to trigger errors on demand to test T217142, pasting the code you have at https://en.wikipedia.beta.wmflabs.org/wiki/User:Tgr/common.js in my browser's console should be enough? [12:36:40] T217142: [WIP] [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors - https://phabricator.wikimedia.org/T217142 [12:43:48] 10Operations, 10observability: consider running bastion Prometheis inside cgroups - https://phabricator.wikimedia.org/T226769 (10fgiunchedi) Agreed, memory limiting in the interim while Ganeti is being setup sounds good to me. [12:45:48] 10Operations, 10Analytics, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Samwalton9) Looks good to me, thanks! [12:51:00] godog: yeah, it adds an error generator link on the left sidebar [12:51:43] tgr: awesome, thank you! [12:53:39] tgr: I'm clicking the generated and I see the uncaught error in the console, but I'm not seeing the network request to eventgate-logging in the network tab, it should be there I assume? [12:53:51] the generated link even [12:54:48] yeah, it should [12:55:31] works for me [12:55:35] mhh no still not seeing it, do I need to be logged in perhaps? [12:56:02] if you are not logged in how did you add the script in the first place? [12:56:40] oh, you were using the JS console? [12:56:49] yeah [12:57:05] that won't work, the script triggering the error has to be same-origin [12:57:11] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Vort) Here is a screenshot of laggy... [12:57:39] otherwise the browser will sanitize away all useful details and Raven will ignore it [13:00:23] ack, got it thank you [13:00:41] now trying to create an account on beta I'm getting The database has been automatically locked while the replica database servers catch up to the master. -.- [13:01:09] from Special:CreateAccount that is [13:02:55] I'll ask on -releng [13:03:49] that seems genuine [13:08:22] aye, works now including generating the errors [13:10:26] (03PS1) 10Ayounsi: Apply role netinsights to netflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/519626 (https://phabricator.wikimedia.org/T226810) [13:13:28] (03CR) 10Ayounsi: [C: 03+2] Apply role netinsights to netflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/519626 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [13:15:19] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10awight) @jijiki We're a bit confused because the beta cluster [[ https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version |... [13:15:25] tgr|away: ok now errors show up in logstash too, https://logstash-beta.wmflabs.org/goto/3aa34cf042b6149fe7126cba08df1143 [13:15:34] I'll update the puppet patch [13:20:59] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10Joe) >>! In T223391#5292192, @awight wrote: > @jijiki We're a bit confused because the beta cluster [[ https://en.wikipedia.b... [13:22:18] (03PS3) 10Filippo Giunchedi: logstash: add consumer for client errors [puppet] - 10https://gerrit.wikimedia.org/r/519603 (https://phabricator.wikimedia.org/T217142) [13:22:32] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10awight) >>! In T223391#5292196, @Joe wrote: >>>! In T223391#5292192, @awight wrote: >> @jijiki We're a bit confused because t... [13:24:55] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10Joe) Yes :) It's by far the best option. [13:25:04] <_joe_> awight: yeah, sorry, brainfart [13:25:22] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 3 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10awight) Certainly seems to be the case. Good news, with PHP7 enabled I get the expected results where the new version of wik... [13:25:43] _joe_: I didn't smell a thing. [13:31:27] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 2 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10Ottomata) Hm @herron, today we experienced {T226808}, which I think is... [13:35:20] awesome! [13:40:36] (03PS5) 10Giuseppe Lavagetto: Add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [13:46:18] thedj: hey! Can you load the page with inspector enabled and tell me the value of the X-Cache response header? [13:49:04] ema: right now that page returns: cp1089 pass, cp3043 pass, cp3030 pass [13:49:37] and a refresh later its cp1083 pass, cp3033 pass, cp3030 pass [13:49:49] but these all work instantly [13:50:55] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) Hi @revi, good point. I *think* Discourse doesn't do this out of the box, but we can investigate (and probably discuss in a task apart). Maybe a plu... [13:51:47] thedj: ok perfect. The interesting value for me right now is cp3030, the frontend. See https://wikitech.wikimedia.org/wiki/Varnish#X-Cache for an explanation if you're curious :) [13:53:01] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) The queries for `varnishstatsd` metrics I've been able to find during the audit: ` (varnish.$dc.backends.be_*api_svc*.GET.sample_rate, 60... [13:54:39] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10revi) I don't know the exact reason why it is not archived, but I think most plausible reason is, in mailman, it is virtually tooooooo hard to erase stuff... [14:01:02] PROBLEM - puppet last run on netflow1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[kafkatee] [14:04:08] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:04:17] (03CR) 10CDanis: "this looks pretty good. given the stack of changes here, I'm not asking for any changes now, just making notes for a followup pass once a" (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans) [14:06:22] !log depool cp2011 and reimage as upload_ats T226637 [14:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:28] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [14:07:06] (03CR) 10Ema: [C: 03+2] cache: reimage cp2011 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519613 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [14:07:15] (03PS2) 10Ema: cache: reimage cp2011 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/519613 (https://phabricator.wikimedia.org/T226637) [14:07:32] !log eevans@deploy1001 scap-helm sessionstore upgrade production -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging] [14:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:42] 10Operations, 10Puppet, 10Packaging: facter3: Unable to parse routing table - https://phabricator.wikimedia.org/T222356 (10jbond) The >>! In T222356#5167750, @jbond wrote: > correct pull request https://github.com/puppetlabs/facter/pull/1775 This has now been merged and i have build a new package [avalibl... [14:10:38] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2011.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim... [14:11:29] !log eevans@deploy1001 scap-helm sessionstore upgrade staging -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging] [14:11:30] !log eevans@deploy1001 scap-helm sessionstore cluster staging completed [14:11:30] !log eevans@deploy1001 scap-helm sessionstore finished [14:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:19] (03CR) 10CDanis: "one question, one thing to fix afterwards" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 (owner: 10Volans) [14:18:27] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) [14:19:57] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:29:03] (03CR) 10CDanis: [C: 03+2] configuration: change IRC default values [software/conftool] - 10https://gerrit.wikimedia.org/r/519459 (owner: 10Volans) [14:29:09] (03CR) 10CDanis: [C: 03+2] dbconfig: allow to remote paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460 (owner: 10Volans) [14:31:48] (03PS6) 10Giuseppe Lavagetto: mediawiki::php: add a fatal error page to go with the proposed wmerrors feature [puppet] - 10https://gerrit.wikimedia.org/r/516988 (https://phabricator.wikimedia.org/T187147) (owner: 10Tim Starling) [14:33:07] (03CR) 10CDanis: [C: 03+2] dbconfig: improve config commit and restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519461 (owner: 10Volans) [14:33:33] (03PS2) 10Volans: dbconfig: structure return values of actions [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 [14:33:35] (03PS2) 10Volans: dbconfig: config diff, on non-empty diff exit 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 [14:33:37] (03PS2) 10Volans: configuration: change IRC default values [software/conftool] - 10https://gerrit.wikimedia.org/r/519459 [14:33:39] (03PS2) 10Volans: dbconfig: allow to remotely paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460 [14:33:41] (03PS2) 10Volans: dbconfig: improve config commit and restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519461 [14:34:11] (03CR) 10Volans: "comments addressed" (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans) [14:34:36] (03CR) 10Volans: "comment addressed, question answered" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 (owner: 10Volans) [14:35:17] (03CR) 10CDanis: dbconfig: structure return values of actions (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans) [14:37:01] (03CR) 10CDanis: [C: 03+2] dbconfig: config diff, on non-empty diff exit 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 (owner: 10Volans) [14:38:57] (03CR) 10Volans: "missed done..." (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans) [14:41:29] (03PS3) 10Volans: dbconfig: structure return values of actions [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 [14:41:31] (03PS3) 10Volans: dbconfig: config diff, on non-empty diff exit 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 [14:41:33] (03PS3) 10Volans: configuration: change IRC default values [software/conftool] - 10https://gerrit.wikimedia.org/r/519459 [14:41:35] (03PS3) 10Volans: dbconfig: allow to remotely paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460 [14:41:37] (03PS3) 10Volans: dbconfig: improve config commit and restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519461 [14:42:09] (03PS1) 10Ladsgroup: labs: Enable jsonld for entity type for beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519646 (https://phabricator.wikimedia.org/T226472) [14:42:20] (03CR) 10CDanis: [C: 03+2] dbconfig: structure return values of actions [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans) [14:43:25] (03CR) 10Ladsgroup: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519646 (https://phabricator.wikimedia.org/T226472) (owner: 10Ladsgroup) [14:44:21] (03Merged) 10jenkins-bot: labs: Enable jsonld for entity type for beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519646 (https://phabricator.wikimedia.org/T226472) (owner: 10Ladsgroup) [14:44:49] ^ rebased on deploy1001 [14:45:17] (03Merged) 10jenkins-bot: dbconfig: structure return values of actions [software/conftool] - 10https://gerrit.wikimedia.org/r/519457 (owner: 10Volans) [14:45:19] (03Merged) 10jenkins-bot: dbconfig: config diff, on non-empty diff exit 1 [software/conftool] - 10https://gerrit.wikimedia.org/r/519458 (owner: 10Volans) [14:45:21] (03Merged) 10jenkins-bot: configuration: change IRC default values [software/conftool] - 10https://gerrit.wikimedia.org/r/519459 (owner: 10Volans) [14:45:37] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) Well, yes, I was about to ask why. :) In Discourse entire topics (threads) or specific posts (messages) can be deleted by admins and moderators fr... [14:45:52] (03CR) 10CDanis: [C: 03+2] dbconfig: allow to remotely paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460 (owner: 10Volans) [14:46:22] (03CR) 10jenkins-bot: labs: Enable jsonld for entity type for beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519646 (https://phabricator.wikimedia.org/T226472) (owner: 10Ladsgroup) [14:47:25] !log upload kafkatee to buster-wikimedia [14:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:27] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2011.codfw.wmnet'] ` and were **ALL** successful. [14:48:29] (03Merged) 10jenkins-bot: dbconfig: allow to remotely paste messages [software/conftool] - 10https://gerrit.wikimedia.org/r/519460 (owner: 10Volans) [14:48:31] (03Merged) 10jenkins-bot: dbconfig: improve config commit and restore [software/conftool] - 10https://gerrit.wikimedia.org/r/519461 (owner: 10Volans) [14:48:40] !log pool cp2011 w/ ATS backend T226637 [14:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:45] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [14:49:02] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) >>! In T204056#5285817, @jcrespo wrote: > This is blocked on @CRoslof or someone else from legal. Is there any more information we can provide on the i... [14:50:07] RECOVERY - puppet last run on netflow1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:51:11] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:54:16] (03CR) 10CDanis: "Hashar, are you still looking for reviews on these Swift changes?" [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [14:59:13] PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:39] PROBLEM - Host ganeti2009 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:51] PROBLEM - Host ganeti2010 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:59] PROBLEM - Host ganeti2013 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:05] PROBLEM - Host ganeti2012 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:55] ignore those ^ [15:02:00] icinga was faster than me [15:02:13] RECOVERY - Host ganeti2010 is UP: PING WARNING - Packet loss = 64%, RTA = 36.22 ms [15:02:15] !log eevans@deploy1001 scap-helm sessionstore upgrade production -f sessionstore-eqiad-values.yaml stable/kask [namespace: sessionstore, clusters: eqiad] [15:02:15] RECOVERY - Host ganeti2014 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [15:02:15] !log eevans@deploy1001 scap-helm sessionstore cluster eqiad completed [15:02:16] !log eevans@deploy1001 scap-helm sessionstore finished [15:02:19] RECOVERY - Host ganeti2009 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [15:02:19] RECOVERY - Host ganeti2013 is UP: PING OK - Packet loss = 0%, RTA = 36.31 ms [15:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:31] RECOVERY - Host ganeti2012 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [15:02:31] PROBLEM - Host ganeti2011 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:31] PROBLEM - Host ganeti2015 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:31] PROBLEM - Host ganeti2016 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:31] PROBLEM - Host ganeti2017 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:59] RECOVERY - Host ganeti2011 is UP: PING WARNING - Packet loss = 64%, RTA = 36.24 ms [15:04:41] RECOVERY - Host ganeti2015 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [15:04:41] RECOVERY - Host ganeti2016 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [15:04:41] RECOVERY - Host ganeti2017 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [15:04:43] !log eevans@deploy1001 scap-helm sessionstore upgrade production -f sessionstore-codfw-values.yaml stable/kask [namespace: sessionstore, clusters: codfw] [15:04:44] !log eevans@deploy1001 scap-helm sessionstore cluster codfw completed [15:04:44] !log eevans@deploy1001 scap-helm sessionstore finished [15:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:51] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) The expiration for objects can be specified at the time of upload, so it needs to be added to our current wo... [15:20:57] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) @EBernhardson @Ottomata re: swift expiring objects, see the link above too and tl;dr is: The X-Delete-... [15:22:24] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) Great! @fgiunchedi you said 'that is something we'd have to deploy first'. Can I use this now? [15:29:52] (03PS1) 10Andrew Bogott: fullstack monitoring: adjust behavior of the leak counter [puppet] - 10https://gerrit.wikimedia.org/r/519653 (https://phabricator.wikimedia.org/T226647) [15:31:02] (03PS2) 10Andrew Bogott: fullstack monitoring: adjust behavior of the leak counter [puppet] - 10https://gerrit.wikimedia.org/r/519653 (https://phabricator.wikimedia.org/T226647) [15:31:45] (03CR) 10Andrew Bogott: [C: 03+2] fullstack monitoring: adjust behavior of the leak counter [puppet] - 10https://gerrit.wikimedia.org/r/519653 (https://phabricator.wikimedia.org/T226647) (owner: 10Andrew Bogott) [15:37:30] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Eevans) >>! In T222960#5289614, @Cmjohnson wrote: > @Eevans Do you still want to move this se... [15:41:27] (03PS1) 10Jhedden: pdns: add rec_control profile to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/519656 (https://phabricator.wikimedia.org/T224688) [15:42:14] (03CR) 10jerkins-bot: [V: 04-1] pdns: add rec_control profile to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/519656 (https://phabricator.wikimedia.org/T224688) (owner: 10Jhedden) [15:43:36] (03PS2) 10Jhedden: pdns: add rec_control profile to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/519656 (https://phabricator.wikimedia.org/T224688) [15:48:43] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) >>! In T213976#5292498, @Ottomata wrote: > Great! @fgiunchedi you said 'that is something we'd have to... [15:49:29] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) Oh ok, will do! [15:51:19] (03CR) 10Andrew Bogott: [C: 03+1] pdns: add rec_control profile to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/519656 (https://phabricator.wikimedia.org/T224688) (owner: 10Jhedden) [15:57:39] (03CR) 10Jhedden: [C: 03+2] pdns: add rec_control profile to codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/519656 (https://phabricator.wikimedia.org/T224688) (owner: 10Jhedden) [16:01:43] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10mmodell) >>! In T213976#4968603, @Ottomata wrote: > @mmodell This is kind of a 'deployment' process thing, is this... [16:01:51] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10mmodell) >>! In T213976#4995886, @Ladsgroup wrote: > Yes, you're right. Maybe turning mwmaint1002 to a minikube and... [16:01:55] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Wikimedia-production-error: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Krinkle) [16:02:22] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Krinkle) [16:04:10] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10Performance-Team (Radar), and 2 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10CDanis) NB that the default limit in Varnish actually... [16:11:17] (03PS1) 10BBlack: varnish: temporarily allow more response headers [puppet] - 10https://gerrit.wikimedia.org/r/519661 (https://phabricator.wikimedia.org/T226840) [16:11:54] (03PS1) 10Cwhite: grafana: fix and update to grafana-dashboard script [puppet] - 10https://gerrit.wikimedia.org/r/519662 [16:12:19] (03CR) 10jerkins-bot: [V: 04-1] grafana: fix and update to grafana-dashboard script [puppet] - 10https://gerrit.wikimedia.org/r/519662 (owner: 10Cwhite) [16:15:38] (03PS1) 10Cwhite: grafana: update varnish-aggregate-client-status-codes to prometheus version [puppet] - 10https://gerrit.wikimedia.org/r/519664 (https://phabricator.wikimedia.org/T184942) [16:16:11] (03Abandoned) 10Cwhite: grafana: remove legacy varnish-aggregate-client-status-codes [puppet] - 10https://gerrit.wikimedia.org/r/519410 (https://phabricator.wikimedia.org/T184942) (owner: 10Cwhite) [16:19:13] (03PS2) 10Cwhite: grafana: fix and update to grafana-dashboard script [puppet] - 10https://gerrit.wikimedia.org/r/519662 [16:32:31] (03CR) 10BBlack: [C: 03+2] varnish: temporarily allow more response headers [puppet] - 10https://gerrit.wikimedia.org/r/519661 (https://phabricator.wikimedia.org/T226840) (owner: 10BBlack) [16:35:10] !log Raising varnish max_http_hdr (max allowed applayer response header count) from 64->128 in systemd config and live tuning - https://gerrit.wikimedia.org/r/519661 - T226840 [16:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:16] T226840: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 [16:36:17] (03PS2) 10Isaac Johnson: Undeploy reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [16:38:10] (03CR) 10Nuria: ReportUpdater: change repo of all queries to reportupdater-queries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T221064) (owner: 10Fdans) [16:39:52] 10Operations, 10Analytics, 10Services: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) EventStreams is hitting its concurrent connection limits of about 200 connections. We think this is probably due to a single cl... [16:41:37] (03PS2) 10Fsero: introducing helmfile.d values for staging cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/517887 (https://phabricator.wikimedia.org/T212130) [16:48:21] (03PS1) 10DLynch: Set some wikis to use the mobile-ve-as-default a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) [16:51:27] 10Operations, 10Wikimedia-Mailing-lists: Reset PW for Code-Health mailing list - https://phabricator.wikimedia.org/T226842 (10Jrbranaa) [17:02:03] PROBLEM - puppet last run on ganeti2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[drbd8-utils] [17:02:19] ah [17:05:03] (03PS1) 10BBlack: Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519670 (https://phabricator.wikimedia.org/T226840) [17:05:43] (03CR) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [17:06:30] (03CR) 10BBlack: [C: 03+2] Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519670 (https://phabricator.wikimedia.org/T226840) (owner: 10BBlack) [17:11:31] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:11:35] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:11:47] PROBLEM - puppet last run on mw1331 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:11:52] 10Operations, 10hardware-requests, 10Discovery-Search (Current work): Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10RobH) [17:11:55] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) [17:12:15] PROBLEM - puppet last run on cp1087 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:12:23] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:12:23] PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:12:47] PROBLEM - puppet last run on ms-fe1007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:13:05] PROBLEM - puppet last run on elastic1052 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:13:07] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:13:13] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:13:21] !log joal@deploy1001 Started deploy [analytics/refinery@8d6fa30]: Laste regular analytics weekly deploy [17:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:33] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:13:33] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:13:37] (03PS1) 10BBlack: Revert "Increase nginx limits on http resp hdr block size" [puppet] - 10https://gerrit.wikimedia.org/r/519672 [17:13:49] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "Increase nginx limits on http resp hdr block size" [puppet] - 10https://gerrit.wikimedia.org/r/519672 (owner: 10BBlack) [17:13:51] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:13:51] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:13:51] (03CR) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [17:13:57] PROBLEM - puppet last run on mw2275 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:13:59] PROBLEM - puppet last run on mw2155 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:13:59] PROBLEM - puppet last run on mw2178 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:14:03] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:14:33] known and not breaking prod ^ [17:14:38] nginx-reload issues are from the patch I'm reverting. either way it's noise and non-affecting of prod traffic [17:14:47] PROBLEM - puppet last run on elastic2026 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:14:49] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:14:53] PROBLEM - puppet last run on cp1078 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:01] PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:01] PROBLEM - puppet last run on mw1321 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:17] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:27] PROBLEM - puppet last run on mw1320 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:31] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:33] PROBLEM - puppet last run on cp1081 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:35] PROBLEM - puppet last run on cp5007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:41] PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:47] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:51] PROBLEM - puppet last run on mw2273 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:15:57] PROBLEM - puppet last run on cp4026 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:03] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:21] PROBLEM - puppet last run on elastic2034 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:27] PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:27] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:27] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:27] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:31] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:33] PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:35] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:35] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:35] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:37] PROBLEM - puppet last run on elastic2043 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:37] PROBLEM - puppet last run on mw1333 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:45] PROBLEM - puppet last run on mw2278 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:47] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) 05Stalled→03Open [17:16:55] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:16:55] PROBLEM - puppet last run on cp1079 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:07] PROBLEM - puppet last run on mw1313 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:07] PROBLEM - puppet last run on cp5004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:13] PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:21] PROBLEM - puppet last run on mw2282 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:21] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:23] PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:35] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:35] PROBLEM - puppet last run on mw2219 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:45] PROBLEM - puppet last run on elastic2031 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:45] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:55] PROBLEM - puppet last run on mw2269 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:55] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:55] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:57] PROBLEM - puppet last run on mw2254 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:17:57] PROBLEM - puppet last run on mw1315 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:01] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:05] PROBLEM - puppet last run on elastic2027 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:09] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:31] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:37] PROBLEM - puppet last run on elastic2052 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:37] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:37] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:37] PROBLEM - puppet last run on cp4032 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:41] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:47] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:53] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:55] PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:18:55] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:01] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:19] PROBLEM - puppet last run on mw1344 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:19] PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:19] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:19] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:21] PROBLEM - puppet last run on elastic2038 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:25] PROBLEM - puppet last run on mw2264 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:25] PROBLEM - puppet last run on mw2272 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:27] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:27] PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:33] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:33] PROBLEM - puppet last run on mw2159 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:33] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:43] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:43] PROBLEM - puppet last run on cp4024 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:19:49] (03PS1) 10CDanis: Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519675 [17:19:53] PROBLEM - puppet last run on elastic2025 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:20:15] PROBLEM - puppet last run on cp4029 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:20:17] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:20:21] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:20:43] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:20:43] PROBLEM - puppet last run on mw1330 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:20:47] PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:21:07] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:21:13] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:21:19] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:21:23] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:21:25] PROBLEM - puppet last run on elastic1049 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:22:30] 10Operations, 10Analytics, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Pchelolo) [17:23:37] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] [17:24:31] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 4948 MB (3% inode=82%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:26:01] (03PS2) 10CDanis: Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519675 (https://phabricator.wikimedia.org/T226840) [17:26:53] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:27:25] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:27:49] PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 4673 MB (3% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:28:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) [17:29:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) [17:31:19] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:31:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, 10User-Elukey: Create an-tool1006, a ganeti vm to be used as client for the Hadoop test cluster - https://phabricator.wikimedia.org/T226844 (10elukey) [17:31:49] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:32:07] (03PS3) 10CDanis: Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519675 (https://phabricator.wikimedia.org/T226840) [17:33:05] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:33:09] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:36:32] !log restarting eventstreams on scb1001 with trace logging of X-Client-IP for T226808 [17:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:37] T226808: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 [17:37:09] huh elukey did you know [17:37:09] https://config-master.wikimedia.org/pybal/eqiad/eventstreams [17:37:15] the scbs have different weights!? [17:37:26] (03PS4) 10CDanis: Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519675 (https://phabricator.wikimedia.org/T226840) [17:37:54] nope [17:38:25] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:38:29] RECOVERY - puppet last run on elastic2043 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:38:47] RECOVERY - puppet last run on cp1079 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:38:51] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:38:55] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:39:05] RECOVERY - puppet last run on mw1331 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:39:39] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:39:39] RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:39:59] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:40:03] RECOVERY - puppet last run on ms-fe1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:40:23] RECOVERY - puppet last run on elastic1052 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:40:23] RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:40:31] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:40:51] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:40:51] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:41:07] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:41:09] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:41:13] RECOVERY - puppet last run on elastic2038 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:41:17] RECOVERY - puppet last run on mw2275 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:41:19] RECOVERY - puppet last run on mw2155 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:41:19] RECOVERY - puppet last run on mw2178 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:41:21] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:41:23] elukey: do you know [17:41:27] is this still the proper procedure? [17:41:27] https://wikitech.wikimedia.org/wiki/LVS#Pool_or_depool_hosts_(for_non-Etcd_managed_pools) [17:42:05] (03CR) 10Esanders: [C: 03+1] Set some wikis to use the mobile-ve-as-default a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) (owner: 10DLynch) [17:42:09] RECOVERY - puppet last run on elastic2026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:42:09] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:42:13] RECOVERY - puppet last run on cp1078 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:42:19] RECOVERY - puppet last run on mw1321 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:42:19] RECOVERY - puppet last run on ms-fe1005 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [17:42:33] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:42:33] RECOVERY - puppet last run on mw1330 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:42:35] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:42:49] RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:42:53] RECOVERY - puppet last run on cp1081 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:42:55] RECOVERY - puppet last run on cp5007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:42:57] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:42:59] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:43:07] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:43:09] RECOVERY - puppet last run on mw2273 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:43:17] RECOVERY - puppet last run on cp4026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:43:23] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:43:39] RECOVERY - puppet last run on elastic2034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:43:41] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:43:47] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:43:47] RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:43:47] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:43:47] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:43:51] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:43:53] RECOVERY - puppet last run on elastic1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:43:55] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:43:55] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:43:57] RECOVERY - puppet last run on mw1333 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:44:02] ottomata: no no [17:44:05] RECOVERY - puppet last run on mw2278 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:44:15] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:44:22] yeahh seemed wrong... [17:44:25] RECOVERY - puppet last run on mw1313 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:44:27] RECOVERY - puppet last run on cp5004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:44:33] RECOVERY - puppet last run on mw2215 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:44:39] ottomata: there is a confctl tool on puppet master [17:44:39] RECOVERY - puppet last run on mw2282 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:44:39] RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:44:41] RECOVERY - puppet last run on mw2249 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:44:53] RECOVERY - puppet last run on mw2219 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:44:53] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:45:01] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:45:01] RECOVERY - puppet last run on elastic2031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:01] PROBLEM - puppet last run on cp1087 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 9 seconds ago with 3 failures. Failed resources (up to 3 shown): Service[nginx],Exec[nginx-reload] [17:45:11] RECOVERY - puppet last run on mw2269 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:45:11] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:45:11] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:45:13] RECOVERY - puppet last run on mw1315 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:45:13] RECOVERY - puppet last run on mw2254 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:17] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:45:21] RECOVERY - puppet last run on elastic2027 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:21] ok found more docs [17:45:27] haven't depooled an individual service in a while [17:45:30] ottomata: should be sudo -i confctl --quiet depool --hostname scb1001.eqiad.wmnet --service eventstreams [17:45:46] ok thank you [17:45:47] running that [17:45:47] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:55] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:45:55] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:45:55] RECOVERY - puppet last run on elastic2052 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:45:55] RECOVERY - puppet last run on cp4032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:57] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:46:05] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:46:11] RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:46:11] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:46:11] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:46:17] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:46:37] RECOVERY - puppet last run on mw1344 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:46:37] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:46:37] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:46:37] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:46:43] RECOVERY - puppet last run on mw2264 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:46:43] RECOVERY - puppet last run on mw2272 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:46:45] RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:46:45] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:46:49] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:46:51] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:46:51] RECOVERY - puppet last run on mw2159 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:47:01] RECOVERY - puppet last run on cp4024 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:47:01] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:47:11] RECOVERY - puppet last run on elastic2025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:47:35] RECOVERY - puppet last run on cp4029 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:47:35] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:47:39] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:48:07] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:48:15] RECOVERY - puppet last run on mw1320 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:48:31] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:48:43] RECOVERY - puppet last run on elastic1049 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:49:12] (03CR) 10CDanis: [C: 03+2] Increase nginx limits on http resp hdr block size [puppet] - 10https://gerrit.wikimedia.org/r/519675 (https://phabricator.wikimedia.org/T226840) (owner: 10CDanis) [17:50:51] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:53:19] !log increasing nginx proxy_buffer_size / proxy_buffers 02d7bcaa [17:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:55] RECOVERY - puppet last run on cp1087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:06:01] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [18:06:51] (03PS4) 10Bstorm: toolforge: start the configuration yaml for kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/519527 (https://phabricator.wikimedia.org/T215531) [18:06:56] !log joal@deploy1001 Finished deploy [analytics/refinery@8d6fa30]: Laste regular analytics weekly deploy (duration: 53m 35s) [18:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:26] !log joal@deploy1001 Started deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only [18:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:30] !log joal@deploy1001 Finished deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only (duration: 00m 04s) [18:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:04] !log joal@deploy1001 Started deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only [18:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:09] !log joal@deploy1001 Finished deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only (duration: 00m 05s) [18:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:43] !log joal@deploy1001 Started deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only again [18:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:09] !log joal@deploy1001 Finished deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1003 only again (duration: 00m 26s) [18:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:18] !log systemctl reset-failed kafka* units on kafka2001 (in decom phase) [18:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:32] !log joal@deploy1001 Started deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1004 only [18:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:27] 10Operations, 10Wikimedia-Mailing-lists, 10User-greg: Reset PW for Code-Health mailing list - https://phabricator.wikimedia.org/T226842 (10greg) 05Open→03Resolved a:03greg I have it, I can share with @Jrbranaa . [18:14:34] !log joal@deploy1001 Finished deploy [analytics/refinery@8d6fa30]: Late regular analytics weekly deploy - notebook1004 only (duration: 01m 03s) [18:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:00] (03CR) 10Andrew Bogott: [C: 03+1] "Thank you for doing this! I can't think of any reason why /adding/ things to this repo would do any harm, so the only real bits of intere" [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond) [18:23:49] (03CR) 10Andrew Bogott: [C: 03+1] "(I'm happy to merge and babysit if you'd like me to, just lmk)" [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond) [18:25:20] (03CR) 10Alex Monk: "+1 to the concept" [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond) [18:27:22] (03PS4) 10Jbond: missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 [18:33:25] !log joal@deploy1001 Started deploy [analytics/refinery@de8eb99]: Missing bit of regular analytics deploy [18:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:35] 10Operations, 10Analytics, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) Collected some info about which IPs were connecting on scb1001. Over a period of about 40 minutes: 3 "100.26.... [18:42:03] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:42:07] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:44:39] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:45:09] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:45:15] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [18:45:50] (03PS5) 10Andrew Bogott: missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond) [18:46:00] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond) [18:46:23] (03PS3) 10Andrew Bogott: audit_hiera: This is a small script to audit the private repo [labs/private] - 10https://gerrit.wikimedia.org/r/519050 (https://phabricator.wikimedia.org/T226530) (owner: 10Jbond) [18:46:34] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] audit_hiera: This is a small script to audit the private repo [labs/private] - 10https://gerrit.wikimedia.org/r/519050 (https://phabricator.wikimedia.org/T226530) (owner: 10Jbond) [18:47:16] (03PS6) 10Andrew Bogott: missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond) [18:47:19] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 (owner: 10Jbond) [18:48:05] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [18:50:31] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:51:01] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:51:13] !log joal@deploy1001 Finished deploy [analytics/refinery@de8eb99]: Missing bit of regular analytics deploy (duration: 17m 47s) [18:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:17] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:52:21] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:03:12] !log joal@deploy1001 Started deploy [analytics/refinery@de8eb99]: Missing bit of regular analytics deploy [19:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:20] !log joal@deploy1001 Finished deploy [analytics/refinery@de8eb99]: Missing bit of regular analytics deploy (duration: 02m 08s) [19:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:14] (03CR) 10Jforrester: "Note to deployer: mobile.php is CommonSettings.php-like, so the IS change has to be deployed first or you'll break the world." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) (owner: 10DLynch) [19:18:40] (03PS1) 10Mforns: analytics::refinery::job::data_purge Migrate webrequest timers to new deletion script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) [19:19:02] (03PS2) 10Mforns: analytics::refinery::job::data_purge Migrate webrequest timers to new deletion script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) [19:19:59] (03CR) 10jerkins-bot: [V: 04-1] analytics::refinery::job::data_purge Migrate webrequest timers to new deletion script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [19:20:58] (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [19:23:26] (03PS3) 10Mforns: analytics::refinery::job::data_purge Migrate webrequest timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) [19:28:02] (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519683 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [19:31:25] (03PS1) 10Mforns: analytics::refinery::job::data_purge Migrate EL timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519685 (https://phabricator.wikimedia.org/T226862) [19:33:12] (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519685 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [19:36:22] (03PS1) 10Mforns: analytics::refinery::job::data_purge Remove timer for WDQS extract [puppet] - 10https://gerrit.wikimedia.org/r/519688 (https://phabricator.wikimedia.org/T226862) [19:40:05] Friday deploy ahoy. [19:40:09] (03PS1) 10Mforns: analytics::refinery::job::data_purge Migrate mediawiki timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519690 (https://phabricator.wikimedia.org/T226862) [19:41:04] (03CR) 10Jforrester: [C: 03+2] Set some wikis to use the mobile-ve-as-default a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) (owner: 10DLynch) [19:41:06] (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519690 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [19:42:04] (03Merged) 10jenkins-bot: Set some wikis to use the mobile-ve-as-default a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) (owner: 10DLynch) [19:42:19] (03CR) 10jenkins-bot: Set some wikis to use the mobile-ve-as-default a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519667 (https://phabricator.wikimedia.org/T221196) (owner: 10DLynch) [19:45:06] (03PS1) 10Mforns: analytics::refinery::job::data_purge Migrate banner timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519691 (https://phabricator.wikimedia.org/T226862) [19:46:56] (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519691 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [19:48:15] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T221196 VE mobile A/B test part 1 (duration: 00m 50s) [19:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:22] T221196: Set VE as default for target wikis in A/B test - https://phabricator.wikimedia.org/T221196 [19:48:42] (03PS1) 10Mforns: analytics::refinery::job::data_purge Migrate geoeditors timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519693 (https://phabricator.wikimedia.org/T226862) [19:49:26] !log jforrester@deploy1001 Synchronized wmf-config/mobile.php: T221196 VE mobile A/B test part 2 (duration: 00m 49s) [19:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:50] (03CR) 10Mforns: [C: 04-1] "I tested this thoroughly, plus the checksums ensure the script execution corresponds to the test I did. But, we can wait until I'm back to" [puppet] - 10https://gerrit.wikimedia.org/r/519693 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [19:50:40] (03CR) 10Mforns: "WDQS extract does no longer exist." [puppet] - 10https://gerrit.wikimedia.org/r/519688 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [19:59:19] (03PS1) 10Bstorm: toolforge: the kubeadm repo can't be labeled trusted in puppet apparently [puppet] - 10https://gerrit.wikimedia.org/r/519696 (https://phabricator.wikimedia.org/T215531) [19:59:41] Lucas_WMDE: Welcome. ;-) [19:59:55] * James_F is having fun, waiting for CI. [20:00:02] thanks ^^ [20:00:09] * Lucas_WMDE peeks at the log [20:00:21] looks like the backport isn’t your only friday deployment today [20:01:00] Yeah, given I was deploying anyway I helped the Editing team out with their forgotten config patch. [20:06:32] Looks good, was able to move on TestCommons. [20:07:22] Wikibase editing still seems to work there as well [20:07:30] OK, deploying. [20:08:07] And no surprise move tabs on Q pages. [20:08:24] good point [20:08:46] Lucas_WMDE: Do you want me to update on Commons village pump? [20:08:53] You should go have your Friday evening. :-) [20:09:02] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/Wikibase/repo/RepoHooks.php: Make it possible for File pages to be moved on Commons again T224303 T226672 (duration: 00m 50s) [20:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:09] T224303: Wikibase Repo prevents page moves in NS 0 on Commons - https://phabricator.wikimedia.org/T224303 [20:09:09] T226672: File move/ rename tab not working in Commons - https://phabricator.wikimedia.org/T226672 [20:09:13] please update, yes :) [20:09:23] but my Friday evening is fine tyvm :D [20:09:30] * James_F grins. [20:10:45] PROBLEM - HHVM rendering on mw1327 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:12:15] RECOVERY - HHVM rendering on mw1327 is OK: HTTP OK: HTTP/1.1 200 OK - 81581 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:12:34] fatalmonitor looks fine [20:12:40] Yup. [20:22:06] (03PS2) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) [20:23:13] (03CR) 10Bstorm: [C: 03+2] toolforge: the kubeadm repo can't be labeled trusted in puppet apparently [puppet] - 10https://gerrit.wikimedia.org/r/519696 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [20:25:34] (03PS3) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) [20:27:21] (03CR) 10Gergő Tisza: Add rate limiter to Special:ConfirmEmail - config change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [20:49:06] (03PS4) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) [20:49:35] (03CR) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [20:51:29] (03CR) 10Urbanecm: [C: 04-1] "Just noticed wikimania2018wiki is in /dblists/commonsuploads.dblist. I feel there can be similar problems as there were with closing priva" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio) [20:58:55] (03PS1) 10Jhedden: icinga: fix tools checker stretch jobs [puppet] - 10https://gerrit.wikimedia.org/r/519718 (https://phabricator.wikimedia.org/T213413) [20:59:03] 10Operations, 10Analytics, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) [21:03:05] looks like everything’s fine after the deploy so I’m logging off now, have a nice weekend :) [21:16:47] !log otto@deploy1001 Started deploy [eventstreams/deploy@2af2719]: Manually blacklisting IP - T226808 [21:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:53] T226808: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 [21:19:54] !log otto@deploy1001 Finished deploy [eventstreams/deploy@2af2719]: Manually blacklisting IP - T226808 (duration: 03m 07s) [21:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:54] 10Operations, 10Analytics, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Ottomata) To hold us over on the weekend, I've manually blacklisted the offending IP in Eve... [21:32:21] (03CR) 10Bstorm: [C: 03+1] icinga: fix tools checker stretch jobs [puppet] - 10https://gerrit.wikimedia.org/r/519718 (https://phabricator.wikimedia.org/T213413) (owner: 10Jhedden) [21:33:30] (03CR) 10Jhedden: [C: 03+2] icinga: fix tools checker stretch jobs [puppet] - 10https://gerrit.wikimedia.org/r/519718 (https://phabricator.wikimedia.org/T213413) (owner: 10Jhedden) [21:53:36] (03PS1) 10Bstorm: aptrepo: fix the kubeadm packages to include containerd.io [puppet] - 10https://gerrit.wikimedia.org/r/519726 (https://phabricator.wikimedia.org/T215975) [22:05:50] (03PS4) 10Jeena Huneidi: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935) [22:06:29] PROBLEM - puppet last run on mw2252 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:07:13] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:07:59] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:26:32] (03PS2) 10MarcoAurelio: Close wikimania2018.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) [22:27:03] (03PS3) 10MarcoAurelio: Close wikimania2018.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) [22:27:31] (03CR) 10MarcoAurelio: "> Can you please remove wikimania2018wiki from" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio) [22:31:48] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518715 (https://phabricator.wikimedia.org/T201188) (owner: 10MarcoAurelio) [22:33:43] RECOVERY - puppet last run on mw2252 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:48:09] 10Operations, 10cloud-services-team (Kanban): netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994 (10faidon) 05Stalled→03Declined I think there's a bit of a confusion. AIUI, nftables can refer to two different things: 1. The nf_tables kernel subsystem 1. The nftables... [22:55:01] (03CR) 10Legoktm: [C: 03+1] "I used furl to bypass varnish/apache on P8685 and I see Accept-Language being added in the Vary header." [puppet] - 10https://gerrit.wikimedia.org/r/519606 (https://phabricator.wikimedia.org/T203179) (owner: 10Ema) [23:03:09] 10Operations, 10Analytics, 10Patch-For-Review, 10Security, 10Services (watching): Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Nuria) {F29666420} Well, blocking that one IP had the effect of lowering connections. Give...