[00:18:12] (03PS3) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [00:18:22] (03PS1) 10Cwhite: smart: simplify PD [puppet] - 10https://gerrit.wikimedia.org/r/588515 (https://phabricator.wikimedia.org/T199236) [00:18:31] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [00:19:40] (03PS4) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [00:23:23] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Reedy) [00:31:41] (03PS6) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [00:31:49] (03CR) 10CRusnov: "> Patch Set 2:" (0311 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [00:51:10] (03PS7) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [00:51:29] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [00:52:32] (03CR) 10CRusnov: "> Patch Set 5:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [01:05:37] (03PS8) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [01:06:41] (03CR) 10CRusnov: "> Patch Set 7:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [02:45:26] (03PS1) 10Andrew Bogott: mariadb ferm: fix a typo in the designate_hosts lookup [puppet] - 10https://gerrit.wikimedia.org/r/588519 (https://phabricator.wikimedia.org/T249941) [02:50:06] (03CR) 10Andrew Bogott: [C: 03+2] mariadb ferm: fix a typo in the designate_hosts lookup [puppet] - 10https://gerrit.wikimedia.org/r/588519 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [04:02:34] PROBLEM - snapshot of s7 in eqiad on db1115 is CRITICAL: snapshot for s7 at eqiad taken more than 3 days ago: Most recent backup 2020-04-11 03:53:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:07:18] (03CR) 10Abijeet Patro: "recheck" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/587743 (owner: 10L10n-bot) [04:52:21] (03CR) 10Marostegui: [C: 03+2] clouddb.sql.erb: Add GRANTs file [puppet] - 10https://gerrit.wikimedia.org/r/588202 (owner: 10Marostegui) [04:56:29] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool pc2008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588538 [04:56:34] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool pc2008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588538 [04:58:08] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool pc2008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588538 (owner: 10Marostegui) [04:59:13] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool pc2008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588538 (owner: 10Marostegui) [05:01:25] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool pc2008 after upgrade (duration: 01m 00s) [05:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:24] (03PS1) 10Muehlenhoff: Extend MOU for shiladsen [puppet] - 10https://gerrit.wikimedia.org/r/588539 [05:12:49] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for shiladsen [puppet] - 10https://gerrit.wikimedia.org/r/588539 (owner: 10Muehlenhoff) [05:25:55] 10Operations, 10DC-Ops: Wipe of spare/replacement disks - https://phabricator.wikimedia.org/T166368 (10MoritzMuehlenhoff) >>! In T166368#6052598, @faidon wrote: > If I understand it correctly, this task is specifically about a box that was returned to the spare pool and then was reallocated for a new purpose b... [05:27:32] 10Operations, 10Traffic, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) Since yesterday, memory increased up to 10Gb in cp1085, current snapshot: ` Allocated | In-Use | Type Size | Free List Name --------------------|------... [05:29:51] 10Operations, 10serviceops, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10MoritzMuehlenhoff) >>! In T237889#6052809, @Andrew wrote: > Pinging @MoritzMuehlenhoff, any objections to this? Fine with me. php-ldap is built from the core PHP package... [05:30:31] !log rolling upgrade to ats 8.0.7-rc0-1wm1 in esams and eqiad [05:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:05] (03PS1) 10KartikMistry: Update cxserver to 2020-04-13-094138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/588540 (https://phabricator.wikimedia.org/T239459) [05:42:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [05:47:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/587529 (https://phabricator.wikimedia.org/T249724) (owner: 10Alexandros Kosiaris) [06:05:49] (03PS2) 10Elukey: admin: allow gpu-users to use radeontop [puppet] - 10https://gerrit.wikimedia.org/r/587726 [06:11:45] (03CR) 10Elukey: [C: 03+2] admin: allow gpu-users to use radeontop [puppet] - 10https://gerrit.wikimedia.org/r/587726 (owner: 10Elukey) [06:14:09] (03PS4) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [06:31:34] 10Operations, 10serviceops, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10Joe) >>! In T237889#6053314, @bd808 wrote: >>>! In T237889#6053313, @Joe wrote: >> Also: how temporary? Do you have a tentative timeline for transitioning wikitech to SUL... [06:36:27] 10Operations, 10Keyholder, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10hashar) > There are a bunch of pending changes in Gerrit for about a year, plus more that I'v... [06:40:24] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10elukey) p:05Triage→03High [06:43:25] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10MoritzMuehlenhoff) p:05Triage→03Medium [06:50:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [06:51:48] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10MoritzMuehlenhoff) p:05Triage→03Medium [06:55:34] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10elukey) [07:01:33] (03PS2) 10Ema: vcl: toggle to block non-API traffic from public clouds [puppet] - 10https://gerrit.wikimedia.org/r/588135 [07:08:06] (03PS3) 10Ema: vcl: toggle to block non-API traffic from public clouds [puppet] - 10https://gerrit.wikimedia.org/r/588135 [07:08:34] (03PS5) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [07:11:27] (03CR) 10Ema: "vcl noop: https://puppet-compiler.wmflabs.org/compiler1002/21892/" [puppet] - 10https://gerrit.wikimedia.org/r/588135 (owner: 10Ema) [07:14:21] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10elukey) Is there any for of index cleanup? It seems that we are indexing since January, maybe we could clean up some old indexes to regain space? [07:21:11] (03CR) 10Dzahn: "So the intention here is: internet -> ATS -> varnish (3120) -> envoy (444) -> nodejs (22280). The last step is happening in https://gerri" [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [07:21:34] (03CR) 10Dzahn: "So the intention here is: internet -> ATS -> varnish (3120) -> envoy (444) -> nodejs (22280). The first steps are happening in https://ge" [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [07:24:50] (03CR) 10Dzahn: ATS/phabricator: directly talk wss:// to aphlict (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [07:26:43] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) [07:26:50] kormat: ^ [07:28:03] (03CR) 10Elukey: [C: 03+2] Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [07:33:36] !log apply CMS GC settings to chi on cloudelastic1001 - T231517 [07:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:43] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [07:37:10] (03PS1) 10Dzahn: add people2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/588647 (https://phabricator.wikimedia.org/T247649) [07:38:39] (03CR) 10Dzahn: [C: 03+2] add people2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/588647 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn) [07:38:50] (03PS2) 10Dzahn: add people2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/588647 (https://phabricator.wikimedia.org/T247649) [07:42:37] (03CR) 10Vgutierrez: ATS/phabricator: directly talk wss:// to aphlict (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [07:47:31] (03CR) 10Dzahn: "> Patch Set 10:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [07:55:40] (03PS1) 10JMeybohm: admin: Add basic dotfiles for jayme [puppet] - 10https://gerrit.wikimedia.org/r/588649 [07:59:03] (03PS1) 10Ema: vcl: remove n-hit-wonder admission policy [puppet] - 10https://gerrit.wikimedia.org/r/588650 (https://phabricator.wikimedia.org/T249809) [07:59:35] Hello, i can provision wikimedia puppet in AWS??? [07:59:57] (03PS6) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [08:01:36] (03CR) 10Joaquin: "Merge it now." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [08:01:44] (03CR) 10Ema: "VCL looks good: https://puppet-compiler.wmflabs.org/compiler1002/21893/" [puppet] - 10https://gerrit.wikimedia.org/r/588650 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [08:03:26] !log [08:04:37] !log joaquin: Upgrading to ATS 8.0.7, stage at 0% completed. [08:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:10] !log Upgrading to ATS 8.0.7, stage at 50% completed. [08:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:33] <_joe_> sonic2020: please stop [08:05:48] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [08:05:51] _joe_ OK. I can provision Wikimedia Puppet in AWS??? [08:06:10] jynus You cannot ban me. [08:06:23] lol [08:06:26] <_joe_> jynus: ahem [08:06:53] I have a oportunnity. [08:07:20] _joe_ I can provision Wikimedia Puppet in AWS??? [08:07:40] <_joe_> sonic2020: i guess you can use our modules, yes, but our infra is not designed to run on AWS [08:09:23] 10Operations: Homr: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 (10ayounsi) p:05Triage→03Medium [08:10:53] _joe_ I can provision my own wiki farm with your own modules with Puppet in AWS??? [08:11:26] (03PS7) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [08:13:56] _joe_ I can provision my own wiki farm with your own modules with Puppet in AWS??? [08:14:43] <_joe_> I already answered, now I'm busy, sorry. [08:15:32] _joe_ I can select a module for Apache Traffic Server??? [08:16:47] I'm not sure, but I guess ATS or varnish aren't purged caches when page are edited [08:17:05] is already filed? [08:17:34] rxy: see https://phabricator.wikimedia.org/T249325 [08:17:50] k, thx :D [08:18:27] rxy: in theory it should eventually happen, with a delay, but people are already working on fixing the delay [08:19:13] My access region is eqsin [08:19:56] rxy I can provision in AWS a Puppet code for Wikimedia. [08:20:59] Swant: ^ [08:21:04] (03PS2) 10Giuseppe Lavagetto: tlsproxy: add the ability to define an idle timeout for upstream connections [puppet] - 10https://gerrit.wikimedia.org/r/587733 [08:21:48] RhinosF1 I can provision in AWS a Puppet code for Wikimedia??? [08:22:40] paravoid: wmf banned, disabled phab, wikitech showing as not existing [08:22:51] thanks [08:25:41] (03PS1) 10Kormat: admin: Add user kormat [puppet] - 10https://gerrit.wikimedia.org/r/588651 (https://phabricator.wikimedia.org/T250134) [08:25:54] PROBLEM - snapshot of s3 in eqiad on db1115 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2020-04-11 07:52:31 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:25:54] \o/ [08:26:38] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/588651 (https://phabricator.wikimedia.org/T250134) (owner: 10Kormat) [08:26:59] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10fgiunchedi) [08:27:52] checking backups [08:33:39] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10fgiunchedi) p:05Triage→03Medium I'm processing this access request as part of SRE clinic duty, however I'm still unable to confirm whether there is an NDA on... [08:35:00] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) @kormat has signed https://phabricator.wikimedia.org/L3 at Tue, Apr 14, 10:33 [08:36:45] (03CR) 10Marostegui: admin: Add user kormat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588651 (https://phabricator.wikimedia.org/T250134) (owner: 10Kormat) [08:38:50] (03CR) 10Kormat: admin: Add user kormat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588651 (https://phabricator.wikimedia.org/T250134) (owner: 10Kormat) [08:41:20] (03PS2) 10Kormat: admin: Add user kormat [puppet] - 10https://gerrit.wikimedia.org/r/588651 (https://phabricator.wikimedia.org/T250134) [08:44:30] PROBLEM - snapshot of x1 in eqiad on db1115 is CRITICAL: snapshot for x1 at eqiad taken more than 3 days ago: Most recent backup 2020-04-11 08:24:36 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:44:54] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:45:23] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10fgiunchedi) [08:45:25] 10Operations, 10SRE-Access-Requests: Requesting access to wdqs-admins for Addshore - https://phabricator.wikimedia.org/T250137 (10Addshore) [08:45:48] 10Operations, 10SRE-Access-Requests: Requesting access to wdqs-admins for Addshore - https://phabricator.wikimedia.org/T250137 (10Addshore) [08:46:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:46:53] (03PS3) 10Kormat: admin: Add user kormat [puppet] - 10https://gerrit.wikimedia.org/r/588651 (https://phabricator.wikimedia.org/T250134) [08:49:03] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to wdqs-admins for Addshore - https://phabricator.wikimedia.org/T250137 (10Addshore) [08:49:06] (03CR) 10Marostegui: [C: 03+1] "I have confirmed that 24118 is the uid, I have also verified that Stephen over videocall" [puppet] - 10https://gerrit.wikimedia.org/r/588651 (https://phabricator.wikimedia.org/T250134) (owner: 10Kormat) [08:49:30] !log restart elastic-chi on cloudelastic1001 with -XX:NewSize=10G - T231517 [08:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:38] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [08:49:58] (03PS1) 10Filippo Giunchedi: admin: add andrew-wmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/588653 (https://phabricator.wikimedia.org/T249733) [08:50:03] (03CR) 10Marostegui: [C: 03+2] admin: Add user kormat [puppet] - 10https://gerrit.wikimedia.org/r/588651 (https://phabricator.wikimedia.org/T250134) (owner: 10Kormat) [08:50:08] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10Urbanecm) I think this should have it's priority increased to UBN. I'm receiving many rep... [08:51:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for andrew-wmde - https://phabricator.wikimedia.org/T249733 (10fgiunchedi) @andrew-wmde @Tobi_WMDE_SW please see above, we need signoff before proceeding [08:51:53] (03CR) 10Filippo Giunchedi: [C: 04-1] "Pending manager signoff on task" [puppet] - 10https://gerrit.wikimedia.org/r/588653 (https://phabricator.wikimedia.org/T249733) (owner: 10Filippo Giunchedi) [08:54:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for andrew-wmde - https://phabricator.wikimedia.org/T249733 (10fgiunchedi) Also please specify whether Kerberos user is needed (cfr https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide) [08:56:11] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to wdqs-admins for Addshore - https://phabricator.wikimedia.org/T250137 (10fgiunchedi) Sounds good to me, we need signoff from wqds folks (@Gehel perhaps?) [08:56:23] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to wdqs-admins for Addshore - https://phabricator.wikimedia.org/T250137 (10fgiunchedi) p:05Triage→03Medium [08:58:52] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to wdqs-admins for Addshore - https://phabricator.wikimedia.org/T250137 (10Gehel) Yep, `wdqs-admins` seems like the right group. As the engineering manager from Search Platform team, I completely support @Addshore to be part of the `wdq... [08:59:18] godog: I'm preparing the patch for ^ [08:59:41] gehel: sweet, thank you! [08:59:49] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/21897/ this change is a noop." [puppet] - 10https://gerrit.wikimedia.org/r/587733 (owner: 10Giuseppe Lavagetto) [09:00:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tlsproxy: add the ability to define an idle timeout for upstream connections [puppet] - 10https://gerrit.wikimedia.org/r/587733 (owner: 10Giuseppe Lavagetto) [09:02:32] (03PS1) 10Gehel: admin: add user addshore to the wdqs-admins group [puppet] - 10https://gerrit.wikimedia.org/r/588654 (https://phabricator.wikimedia.org/T250137) [09:03:15] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10fgiunchedi) 05Open→03Stalled Stalling the task for now, as it seems a db test host access i... [09:05:41] (03PS1) 10Ema: cache: ensure vhtcpd is stopped if using purged [puppet] - 10https://gerrit.wikimedia.org/r/588655 (https://phabricator.wikimedia.org/T249583) [09:08:15] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10fgiunchedi) Also cc @KFrancis for NDA confirmation, thanks! [09:08:40] !log Add kormat to ops and wmf ldap groups - T250134 [09:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:48] T250134: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 [09:11:02] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) ssh to bast3004 confirmed by Stephen [09:11:14] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) [09:11:36] (03CR) 10Jbond: [C: 03+2] "looks good will merge, FYI you can self +2 when updating your own dot files" [puppet] - 10https://gerrit.wikimedia.org/r/588649 (owner: 10JMeybohm) [09:13:35] !log add mwilliams to 'wmf' ldap group - T249844 [09:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:40] T249844: LDAP access to the wmf group for Matthew Williams - https://phabricator.wikimedia.org/T249844 [09:14:28] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Matthew Williams - https://phabricator.wikimedia.org/T249844 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @mwilliams access to turnilo/superset should work now, please reopen if needed! [09:16:20] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/588654 (https://phabricator.wikimedia.org/T250137) (owner: 10Gehel) [09:16:59] !log add missing `routing-options rib inet6.0 aggregate defaults discard` where missing (cr3-knams, cr3-esams, cr2-eqord, cr2-eqdfw, cr1/2-eqiad/codfw) [09:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:02] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:19:13] (03PS5) 10Jbond: profile::mail::mx: add type enforcment, lookups and move defaults [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) [09:21:36] (03PS1) 10Kormat: admin: upgrade kormat to root shell user (ops) [puppet] - 10https://gerrit.wikimedia.org/r/588657 (https://phabricator.wikimedia.org/T250134) [09:21:51] (03PS6) 10Jbond: profile::mail::mx: add type enforcment, lookups and move defaults [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) [09:22:30] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:24:34] (03PS5) 10Jbond: profile::mail::mx: Add toggle to enable jumpcloud integration [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) [09:24:46] (03CR) 10Marostegui: [C: 03+1] admin: upgrade kormat to root shell user (ops) [puppet] - 10https://gerrit.wikimedia.org/r/588657 (https://phabricator.wikimedia.org/T250134) (owner: 10Kormat) [09:27:14] (03PS6) 10Jbond: profile::mail::mx: Add toggle to enable jumpcloud integration [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) [09:27:22] 10Operations, 10netops: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 (10faidon) [09:27:34] (03CR) 10Kormat: [C: 03+2] admin: upgrade kormat to root shell user (ops) [puppet] - 10https://gerrit.wikimedia.org/r/588657 (https://phabricator.wikimedia.org/T250134) (owner: 10Kormat) [09:27:58] (03PS2) 10Ema: cache: ensure vhtcpd is stopped if using purged and vice versa [puppet] - 10https://gerrit.wikimedia.org/r/588655 (https://phabricator.wikimedia.org/T249583) [09:29:04] (03CR) 10Vgutierrez: [C: 03+1] cache: ensure vhtcpd is stopped if using purged and vice versa [puppet] - 10https://gerrit.wikimedia.org/r/588655 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [09:29:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to wdqs-admins for Addshore - https://phabricator.wikimedia.org/T250137 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Done, should be fully propagated in 30 min. Thanks @jbond @Gehel ! [09:31:06] 10Operations, 10Services, 10Service-deployment-requests: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10fgiunchedi) p:05Triage→03Medium [09:31:24] 10Operations, 10WM-Bot: wm-bot doesn't set charset=utf-8, which breaks (amongst other things) emoji rendering - https://phabricator.wikimedia.org/T250104 (10fgiunchedi) p:05Triage→03Medium [09:31:35] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Duplicate "moderator request(s) waiting" emails sent to list admins - https://phabricator.wikimedia.org/T250032 (10fgiunchedi) p:05Triage→03Medium [09:31:47] 10Operations, 10Icinga, 10observability: Aggregate Proton, Restbase and mobileapps icinga alerts - https://phabricator.wikimedia.org/T250017 (10fgiunchedi) p:05Triage→03Medium [09:32:00] 10Operations, 10vm-requests: eqiad: 1 VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 (10fgiunchedi) p:05Triage→03Medium [09:33:30] 10Operations, 10SRE-swift-storage: Sanity check global-multiwrite logs for ConfirmEdit usage - https://phabricator.wikimedia.org/T159830 (10fgiunchedi) 05Open→03Resolved p:05Triage→03Medium a:03fgiunchedi Nothing actionable on the swift's end AFAICT. I'll resolve the task but @Reedy please do reopen... [09:34:12] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) Access & root: kormat@puppetmaster1001:~$ sudo -i root@puppetmaster1001:~# kormat@cumin1001:~$ sudo -i root@cumin1001:~# [09:34:24] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) [09:36:20] (03CR) 10Ema: [C: 03+2] cache: ensure vhtcpd is stopped if using purged and vice versa [puppet] - 10https://gerrit.wikimedia.org/r/588655 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [09:37:14] !log cleanup 2620:0:860::/46 and 208.80.152.0/22 aggregates from cr1/2-codfw - T246721 [09:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:21] T246721: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 [09:38:05] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) [09:38:53] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Dzahn) - Invited to "maint-announce" [[ https://groups.google.com/a/wikimedia.org/forum/#!forum/ops-maintenance | shared inbox ]] on Google. - added to "Ops vendor maintenance" [[ https://office.wikimed... [09:39:35] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) [09:39:38] RECOVERY - snapshot of x1 in eqiad on db1115 is OK: snapshot for x1 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-04-14 09:10:42 from db1140.eqiad.wmnet:3320 (164 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:39:55] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Dzahn) [09:40:49] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) [09:41:49] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) @dzahn thank you! <3 [09:42:15] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) [09:43:59] 10Operations, 10vm-requests: eqiad: 1 VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 (10Dzahn) a:03Dzahn [09:47:13] !log cleanup 2620:0:860::/46 and 208.80.152.0/22 aggregates from cr2-eqord - T246721 [09:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:18] T246721: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 [09:48:06] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) [09:48:09] !log cleanup 2620:0:860::/46 and 208.80.152.0/22 aggregates from cr2-eqdfw - T246721 [09:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:15] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) [09:49:23] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) Access to tendril confirmed [09:49:32] !log re-order aggregate routes to standardize order [09:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:39] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: validate config files [puppet] - 10https://gerrit.wikimedia.org/r/587704 (https://phabricator.wikimedia.org/T221052) (owner: 10Filippo Giunchedi) [09:49:53] (03PS1) 10Ema: Release version 0.3 [software/purged] - 10https://gerrit.wikimedia.org/r/588658 (https://phabricator.wikimedia.org/T249583) [09:51:49] (03CR) 10Ema: [V: 03+2 C: 03+2] Release version 0.3 [software/purged] - 10https://gerrit.wikimedia.org/r/588658 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [09:52:58] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Dzahn) a:05Dzahn→03None [09:57:23] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) Invitation sent to ops and ops-private [09:57:35] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) [09:57:45] (03PS1) 10Ema: cache: test purged on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/588660 (https://phabricator.wikimedia.org/T249583) [09:59:58] (03CR) 10Jbond: [C: 03+2] profile::mail::jumpcloud: add new class to manage jumpcloud aliases [puppet] - 10https://gerrit.wikimedia.org/r/585501 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [10:00:21] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [10:00:31] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: config file change canarying for logstash - https://phabricator.wikimedia.org/T221052 (10fgiunchedi) Puppet will ask Logstash to validate individual config files before installing them via `validate_cmd`. The puppet run will fail if e... [10:07:39] 10Operations, 10vm-requests: eqiad/codfw: 1 each VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 (10Dzahn) [10:07:46] (03CR) 10Filippo Giunchedi: [C: 03+1] Logstash Junos PFE Firewall parsing, add PFE_FW_SYSLOG_ETH_IP6_TCP_UDP [puppet] - 10https://gerrit.wikimedia.org/r/587786 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [10:08:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=purged site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:08:33] (03CR) 10Ayounsi: [C: 03+2] Logstash Junos PFE Firewall parsing, add PFE_FW_SYSLOG_ETH_IP6_TCP_UDP [puppet] - 10https://gerrit.wikimedia.org/r/587786 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [10:15:18] (03PS1) 10Kormat: icinga: grant liberal permissions to kormat [puppet] - 10https://gerrit.wikimedia.org/r/588661 (https://phabricator.wikimedia.org/T250134) [10:15:47] !log update prefix-list LVS-service-ips to add missing prefixes [10:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:45] (03CR) 10Filippo Giunchedi: logstash: log safepoints only when running the daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587705 (https://phabricator.wikimedia.org/T221052) (owner: 10Filippo Giunchedi) [10:17:47] (03CR) 10Marostegui: [C: 03+1] icinga: grant liberal permissions to kormat [puppet] - 10https://gerrit.wikimedia.org/r/588661 (https://phabricator.wikimedia.org/T250134) (owner: 10Kormat) [10:18:35] (03CR) 10Kormat: [C: 03+2] icinga: grant liberal permissions to kormat [puppet] - 10https://gerrit.wikimedia.org/r/588661 (https://phabricator.wikimedia.org/T250134) (owner: 10Kormat) [10:25:08] (03CR) 10Jbond: [C: 03+2] profile::mail::mx: Add toggle to enable jumpcloud integration [puppet] - 10https://gerrit.wikimedia.org/r/588420 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [10:25:18] (03CR) 10Jbond: [C: 03+2] profile::mail::mx: add type enforcment, lookups and move defaults [puppet] - 10https://gerrit.wikimedia.org/r/588419 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [10:26:28] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) [10:27:45] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) [10:34:54] (03PS4) 10Hnowlan: profile::kubernetes: add the puppet CA cert to general.yaml [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) [10:36:15] 10Operations, 10Patch-For-Review: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Marostegui) [10:40:39] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10fgiunchedi) Thanks @elukey for looking into this! Yes there's elasticsearch curator to clean up old indices with 90d retention (e.g. on logstash1008). It seems that with daily growt... [10:49:47] (03PS1) 10Ayounsi: Add missing LVS prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/588666 [10:51:41] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10elukey) [10:52:52] (03PS2) 10Ayounsi: Add missing LVS prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/588666 [10:53:14] (03CR) 10Ayounsi: [C: 03+2] Add missing LVS prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/588666 (owner: 10Ayounsi) [10:54:30] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1023 crashed / Smart Storage Battery failure - https://phabricator.wikimedia.org/T249174 (10fgiunchedi) >>! In T249174#6044354, @Jclark-ctr wrote: > @fgiunchedi @Volans i will be on site 4/14/2020. at 10am Est we have limited time on site can we schedu... [10:55:38] !log set uRPF log action to syslog infra wide - T244147 [10:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:51] (03PS1) 10Jcrespo: bacula: Increase max total size of Databases backups to 40 TB [puppet] - 10https://gerrit.wikimedia.org/r/588668 (https://phabricator.wikimedia.org/T138562) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200414T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:22] * Urbanecm steals the window, will add to the calendar [11:01:06] o/ [11:01:13] ok [11:01:21] (03PS1) 10Urbanecm: Revert "Enable cswiki anniversary logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588669 (https://phabricator.wikimedia.org/T249173) [11:01:45] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588669 (https://phabricator.wikimedia.org/T249173) (owner: 10Urbanecm) [11:01:57] !log resizing backup1001:/srv/databases to 40 TB [11:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:49] (03Merged) 10jenkins-bot: Revert "Enable cswiki anniversary logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588669 (https://phabricator.wikimedia.org/T249173) (owner: 10Urbanecm) [11:04:54] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: 7da408e: Revert "Enable cswiki anniversary logo" (T249173) (duration: 01m 00s) [11:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:00] T249173: Czech Wikipedia 450k milestone special logo - https://phabricator.wikimedia.org/T249173 [11:05:46] !log Purge https://en.wikipedia.org/static/images/project-logos/cswiki*.png (T249173) [11:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:39] Lucas_WMDE: done, if you want to use the window for anything [11:07:47] no, nothing from me [11:07:55] I was just responding to jouncebot’s ping ^^ [11:08:02] feel free to close the SWAT [11:08:20] ah, I see [11:08:22] !log EU SWAT done [11:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:39] (03CR) 10Volans: [C: 03+2] netbox: increase changelog retention to 2 years [puppet] - 10https://gerrit.wikimedia.org/r/585463 (owner: 10Volans) [11:17:38] (03PS3) 10Jbond: role::mail::mx: enable jumpcloud test domain [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) [11:22:06] (03CR) 10jerkins-bot: [V: 04-1] role::mail::mx: enable jumpcloud test domain [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [11:27:46] (03PS4) 10Jbond: role::mail::mx: enable jumpcloud test domain [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) [11:39:22] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [11:46:22] (03PS5) 10Hnowlan: profile::kubernetes: add the puppet CA cert to general.yaml [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) [11:50:00] (03PS6) 10Hnowlan: profile::kubernetes: add the puppet CA cert to general.yaml [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) [11:58:25] (03PS7) 10Hnowlan: profile::kubernetes: add the puppet CA cert to general.yaml [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) [11:59:06] (03PS1) 10Ayounsi: uRPF: log to syslog [homer/public] - 10https://gerrit.wikimedia.org/r/588676 (https://phabricator.wikimedia.org/T244147) [12:01:45] (03CR) 10Ayounsi: [C: 03+2] uRPF: log to syslog [homer/public] - 10https://gerrit.wikimedia.org/r/588676 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [12:03:34] !log upgrade haproxy on dns servers [12:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:48] (03CR) 10Hnowlan: "> COOL! Does this mean we can access this as .Values.puppet_ca_crt?" [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [12:14:40] (03CR) 10Marostegui: [C: 03+1] bacula: Increase max total size of Databases backups to 40 TB [puppet] - 10https://gerrit.wikimedia.org/r/588668 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:16:40] (03PS8) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [12:17:32] (03PS3) 10Hashar: protocol.compat: disable a couple of pylint errors [software/keyholder] - 10https://gerrit.wikimedia.org/r/485705 (owner: 10Faidon Liambotis) [12:17:33] (03PS3) 10Hashar: Bump minimum Python to 3.5; also test with 3.7 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485706 (owner: 10Faidon Liambotis) [12:18:15] (03CR) 10jerkins-bot: [V: 04-1] protocol.compat: disable a couple of pylint errors [software/keyholder] - 10https://gerrit.wikimedia.org/r/485705 (owner: 10Faidon Liambotis) [12:18:26] (03CR) 10Jbond: "> Patch Set 7:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [12:19:04] hashar: please code review, but not merge yet [12:19:16] I'll have a look at all this next week [12:19:52] paravoid: though those two first ones are really trivial :] There is a trouble with PyYAML which is its own little madness [12:20:27] paravoid: and yeah I will hold on them. .I just happen to have noticed that version supports reloading the yaml configuration file which saves the trouble to have to restart and reharm the keyholder instances in production [12:21:06] jynus: can we confirm install1003/2003 and apt1001/2001 are in bacula properly? we want to make sure before shutting down install1002/2002. i can look in bconsole myself but is there more? [12:21:18] (03PS1) 10Vgutierrez: ATS: Enable inbound TLSv1.3 in text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/588678 (https://phabricator.wikimedia.org/T170567) [12:22:22] paravoid: i will be happy to review them as needed. Then it is probably not of the uttermost priority ;] [12:22:28] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [12:22:55] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) [12:24:48] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) 05Open→03Resolved dbproxy1019 has been working fine for a week. Considering this done. Next step is to decommission dbproxy1011 : T249590 [12:25:44] (03PS9) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [12:25:46] (03PS1) 10Dzahn: cloud/devtools: add missing gerrit::server::backups_enabled key [puppet] - 10https://gerrit.wikimedia.org/r/588679 [12:26:11] (03CR) 10jerkins-bot: [V: 04-1] cloud/devtools: add missing gerrit::server::backups_enabled key [puppet] - 10https://gerrit.wikimedia.org/r/588679 (owner: 10Dzahn) [12:26:37] (03PS2) 10Dzahn: cloud/devtools: add missing gerrit::server::backups_enabled key [puppet] - 10https://gerrit.wikimedia.org/r/588679 [12:26:44] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/21905/" [puppet] - 10https://gerrit.wikimedia.org/r/588678 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [12:27:22] (03PS4) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [12:27:48] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: add missing gerrit::server::backups_enabled key [puppet] - 10https://gerrit.wikimedia.org/r/588679 (owner: 10Dzahn) [12:28:06] (03PS5) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [12:28:46] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [12:31:26] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [12:34:53] (03CR) 10Jcrespo: [C: 03+1] install_server: Allow reimage of labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/587628 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [12:36:25] mutante: let me check [12:38:13] (03PS1) 10Dzahn: cloud/devtools: add missing gerrit::server::backup_set Hiera key [puppet] - 10https://gerrit.wikimedia.org/r/588682 [12:38:47] (03PS1) 10Ema: Refactor frontendWorker, add worker test [software/purged] - 10https://gerrit.wikimedia.org/r/588683 (https://phabricator.wikimedia.org/T249583) [12:38:59] (03CR) 10Ema: [C: 03+1] ATS: Enable inbound TLSv1.3 in text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/588678 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [12:39:02] jouncebot: now [12:39:02] No deployments scheduled for the next 3 hour(s) and 20 minute(s) [12:41:02] jynus: thanks! a big one is the GPG keys of root in /root [12:41:17] let me add a backup::set if needed.. looking as well [12:41:18] See -sre for less noise [12:43:15] (03PS10) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [12:44:08] (03PS6) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [12:44:23] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [12:45:35] (03CR) 10Vgutierrez: Refactor frontendWorker, add worker test (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/588683 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [12:46:36] (03PS2) 10Ema: 0.4: refactor frontendWorker, add worker test [software/purged] - 10https://gerrit.wikimedia.org/r/588683 (https://phabricator.wikimedia.org/T249583) [12:47:19] (03CR) 10jerkins-bot: [V: 04-1] 0.4: refactor frontendWorker, add worker test [software/purged] - 10https://gerrit.wikimedia.org/r/588683 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [12:47:58] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10elukey) >>! In T247732#6014449, @bmansurov wrote: > @elukey how can I access `http://scb1001.eqiad.wmnet:9632`? Should I be on some host to ping that URL? Also, where can I see the logs?... [12:49:16] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable inbound TLSv1.3 in text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/588678 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [12:49:27] (03PS3) 10Ema: 0.4: refactor frontendWorker, add worker test [software/purged] - 10https://gerrit.wikimedia.org/r/588683 (https://phabricator.wikimedia.org/T249583) [12:50:39] !log Enable inbound TLSv1.3 in text@eqsin - T170567 [12:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:45] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [12:53:22] (03CR) 10Ayounsi: "> @Ayounsi can you give me an example of two boxes with the same VIP? I want to test for really reals on them." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [12:55:38] (03PS4) 10Ema: 0.4: refactor frontendWorker, add worker test [software/purged] - 10https://gerrit.wikimedia.org/r/588683 (https://phabricator.wikimedia.org/T249583) [12:56:36] (03CR) 10Filippo Giunchedi: [C: 03+1] kafkatee::instance: add types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [13:01:38] (03CR) 10Jbond: "thanks for adding the types big improvement added some nits to [hopefully] tighten this up" (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [13:03:16] jbond42: lol your "some nits" seems exactly what I'd expect from a volans' code review! :D: D :D [13:03:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM now, PCC https://puppet-compiler.wmflabs.org/compiler1001/21909/" [puppet] - 10https://gerrit.wikimedia.org/r/588015 (owner: 10Elukey) [13:03:35] jbond42: jokes aside, will follow up, thanks a lot! [13:03:36] lol [13:05:27] (03CR) 10Vgutierrez: [C: 03+1] 0.4: refactor frontendWorker, add worker test [software/purged] - 10https://gerrit.wikimedia.org/r/588683 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:05:50] (03CR) 10Jbond: "missed inputs param" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [13:05:55] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: add missing gerrit::server::backup_set Hiera key [puppet] - 10https://gerrit.wikimedia.org/r/588682 (owner: 10Dzahn) [13:06:08] (03CR) 10Ema: [C: 03+2] vcl: remove n-hit-wonder admission policy [puppet] - 10https://gerrit.wikimedia.org/r/588650 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [13:06:35] mutante: hi :) [13:06:47] mutante: can I puppet-merge your change along with mine? [13:06:48] (03CR) 10Filippo Giunchedi: [C: 03+1] smart: make smart_data_dump importable for adding tests [puppet] - 10https://gerrit.wikimedia.org/r/587795 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [13:06:56] ema: was about to ask the same, please do [13:06:59] ack [13:09:12] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:32] (03PS5) 10Jbond: role::mail::mx: enable jumpcloud test domain [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) [13:10:17] (03CR) 10Ema: [C: 03+2] 0.4: refactor frontendWorker, add worker test [software/purged] - 10https://gerrit.wikimedia.org/r/588683 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:11:46] (03PS1) 10Hashar: zuul: create /var/log/zuul [puppet] - 10https://gerrit.wikimedia.org/r/588687 (https://phabricator.wikimedia.org/T224591) [13:11:53] ACKNOWLEDGEMENT - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn migration ongoing https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:23] ^ most probably me [13:14:47] (03CR) 10Dzahn: "this is not strictly as it is on contint1001. it is zuul:adm there. this would change group ownership on the prod host. but if that's ok " [puppet] - 10https://gerrit.wikimedia.org/r/588687 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [13:14:53] (03CR) 10Ottomata: "@jbond we can take care of using this for event* helmfiles after this lands. https://phabricator.wikimedia.org/T250146" [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [13:15:51] (03PS1) 10Ema: purged: run as nobody/nogroup [puppet] - 10https://gerrit.wikimedia.org/r/588689 (https://phabricator.wikimedia.org/T249583) [13:15:56] mutante: oh. I can make it adm owned, I missed that [13:15:58] (03CR) 10Filippo Giunchedi: [C: 04-1] "Thanks for this! See inline, we should also make sure tests are ran by CI" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [13:16:09] hashar: alright [13:16:58] (03CR) 10Jbond: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [13:17:06] (03PS2) 10Hashar: zuul: create /var/log/zuul [puppet] - 10https://gerrit.wikimedia.org/r/588687 (https://phabricator.wikimedia.org/T224591) [13:18:33] (03CR) 10Dzahn: [C: 03+2] zuul: create /var/log/zuul [puppet] - 10https://gerrit.wikimedia.org/r/588687 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [13:19:51] hashar: done! try again now [13:21:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, see inline for a nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [13:21:36] !log Starting zuul-merger on contint2001 [13:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:04] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:13] hashar: want me to unmask it? [13:22:34] I wanna verify it works first ;] [13:22:46] I tried it locally, but who knows what might happen [13:22:50] is that possible while it's masked? [13:23:02] Apr 14 13:21:57 contint2001 systemd[1]: Started Zuul merger. [13:23:07] apparently yes [13:23:10] Active: inactive (dead) [13:23:29] ooh. ignore me. i made a typo [13:23:41] zuul-merge vs zuul-merger [13:24:56] ah host key verification failed fun [13:25:02] I am pretty sure I got that one solved ages ago [13:26:10] (03PS2) 10Ema: purged: run as unprivileged user instead of root [puppet] - 10https://gerrit.wikimedia.org/r/588689 (https://phabricator.wikimedia.org/T249583) [13:28:02] (03PS1) 10Giuseppe Lavagetto: services_proxy: temp mitigation for intermittent parsoid requests failures [puppet] - 10https://gerrit.wikimedia.org/r/588692 (https://phabricator.wikimedia.org/T249705) [13:29:31] (03CR) 10Muehlenhoff: purged: run as unprivileged user instead of root (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588689 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:29:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: temp mitigation for intermittent parsoid requests failures [puppet] - 10https://gerrit.wikimedia.org/r/588692 (https://phabricator.wikimedia.org/T249705) (owner: 10Giuseppe Lavagetto) [13:30:22] (03PS8) 10Hnowlan: profile::kubernetes: add the puppet CA cert to general.yaml [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) [13:30:49] (03PS3) 10Ema: purged: run as unprivileged user instead of root [puppet] - 10https://gerrit.wikimedia.org/r/588689 (https://phabricator.wikimedia.org/T249583) [13:31:07] (03CR) 10Elukey: kafkatee::instance: add types to parameters (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [13:31:44] (03CR) 10jerkins-bot: [V: 04-1] purged: run as unprivileged user instead of root [puppet] - 10https://gerrit.wikimedia.org/r/588689 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:36:08] (03PS1) 10Jbond: pcc: update pcc to support host variable overrides [puppet] - 10https://gerrit.wikimedia.org/r/588694 (https://phabricator.wikimedia.org/T250168) [13:36:39] (03PS4) 10Ema: purged: run with DynamicUser [puppet] - 10https://gerrit.wikimedia.org/r/588689 (https://phabricator.wikimedia.org/T249583) [13:37:05] (03CR) 10Ema: purged: run with DynamicUser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588689 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:37:36] <_joe_> jbond42: oh thanks, I planned to fix it today <3 [13:38:25] np, forced me to finally configuered may jenkins token :) [13:38:30] (03PS4) 10Elukey: kafkatee::instance: add types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/588086 [13:38:32] (03PS5) 10Elukey: Enable TLS encryption between kafkatee instances and Kafka [puppet] - 10https://gerrit.wikimedia.org/r/588015 [13:39:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/588694 (https://phabricator.wikimedia.org/T250168) (owner: 10Jbond) [13:40:14] (03CR) 10Jbond: [C: 03+2] pcc: update pcc to support host variable overrides [puppet] - 10https://gerrit.wikimedia.org/r/588694 (https://phabricator.wikimedia.org/T250168) (owner: 10Jbond) [13:42:14] PROBLEM - Echostore codfw on echostore.svc.codfw.wmnet is CRITICAL: /echoseen/v1/{key} (Store value for key) is CRITICAL: Test Store value for key returned the unexpected status 500 (expecting: 201) https://www.mediawiki.org/wiki/Kask [13:42:21] (03CR) 10Filippo Giunchedi: pcc: update pcc to support host variable overrides (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588694 (https://phabricator.wikimedia.org/T250168) (owner: 10Jbond) [13:42:59] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10Eevans) >>! In T250050#6051028, @elukey wrote: > @Eevans this is the weekend of broken cassandra hosts, adding you as FYI :) Thanks :) And thank you for taking a look over the weekend, it is much appreciated! [13:43:45] (03PS5) 10Elukey: kafkatee::instance: add types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/588086 [13:43:47] (03PS6) 10Elukey: Enable TLS encryption between kafkatee instances and Kafka [puppet] - 10https://gerrit.wikimedia.org/r/588015 [13:47:43] (03PS6) 10Elukey: kafkatee::instance: add types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/588086 [13:47:45] (03PS7) 10Elukey: Enable TLS encryption between kafkatee instances and Kafka [puppet] - 10https://gerrit.wikimedia.org/r/588015 [13:48:44] (03PS1) 10Ema: purged: run one frontend and multiple backend workers [puppet] - 10https://gerrit.wikimedia.org/r/588698 (https://phabricator.wikimedia.org/T249583) [13:50:33] (03PS1) 10Dzahn: gerrit: remove unused parameter for cache_text_nodes [puppet] - 10https://gerrit.wikimedia.org/r/588699 [13:51:22] (03CR) 10Elukey: "All good pcc for next patch works fine!" [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [13:51:38] (03CR) 10Vgutierrez: "This change is ready for review." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:52:17] (03CR) 10Elukey: "Jbond: let me know if it is enough or not, will merge if good!" [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [13:52:24] (03CR) 10Jbond: "I just so happend to be looking at pcc" (0314 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [13:52:44] (03CR) 10Krinkle: [C: 04-1] "Indeed. Seems like we're not using rebounds the way I thought." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586390 (https://phabricator.wikimedia.org/T249325) (owner: 10CDanis) [13:52:47] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10Eevans) OK, so it seems like we have a failed SSD (`/dev/sdc`), and as a result, some degraded arrays. Ideally we'd be able to replace the SSD and rebuild the array, but we are using the `/dev/sd[x]4` parti... [13:52:57] (03CR) 10Muehlenhoff: [C: 03+1] "Beautiful" [puppet] - 10https://gerrit.wikimedia.org/r/588689 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:53:28] (03Abandoned) 10CDanis: reverse-proxy: Disable rebound purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586390 (https://phabricator.wikimedia.org/T249325) (owner: 10CDanis) [13:53:32] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Kormat) [13:54:43] (03CR) 10jerkins-bot: [V: 04-1] gerrit: remove unused parameter for cache_text_nodes [puppet] - 10https://gerrit.wikimedia.org/r/588699 (owner: 10Dzahn) [13:55:15] !log upload purged 0.4 to buster-wikimedia T249583 [13:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:23] T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 [13:56:26] (03CR) 10Elukey: "> (14 comments)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [13:57:43] (03CR) 10Ema: [C: 03+1] Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:58:07] (03CR) 10Jbond: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [13:58:18] jbond42: does Pattern[/\d+(-\d+)*/] looks better? [13:58:43] elukey: i think ? instead of * [13:58:56] Pattern[/\d+(-\d+)?/] [13:59:04] (03PS1) 10Gergő Tisza: Deploy Welcome Survey to Serbian Wikipedia and French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588701 (https://phabricator.wikimedia.org/T249956) [13:59:05] ah sure either 1 or none, better than * [13:59:17] yes [14:00:00] (03PS7) 10Elukey: kafkatee::instance: add types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/588086 [14:00:02] (03PS8) 10Elukey: Enable TLS encryption between kafkatee instances and Kafka [puppet] - 10https://gerrit.wikimedia.org/r/588015 [14:00:08] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10Papaul) a:03Papaul [14:00:22] (03CR) 10Ema: [C: 03+2] purged: run with DynamicUser [puppet] - 10https://gerrit.wikimedia.org/r/588689 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [14:00:57] (03CR) 10Ema: [C: 03+2] purged: run one frontend and multiple backend workers [puppet] - 10https://gerrit.wikimedia.org/r/588698 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [14:01:06] (03PS2) 10Ema: purged: run one frontend and multiple backend workers [puppet] - 10https://gerrit.wikimedia.org/r/588698 (https://phabricator.wikimedia.org/T249583) [14:01:46] (03PS1) 10Jbond: cli: add pcc invocation to logging output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588703 (https://phabricator.wikimedia.org/T250169) [14:02:02] jbond42: mmm [14:02:03] - ssl_ca_location => /etc/ssl/certs/Puppet_Internal_CA.pem [14:02:03] + ssl_ca_location => /var/lib/puppet/ssl/certs/ca.pem [14:02:10] is it the same? [14:02:34] RECOVERY - Echostore codfw on echostore.svc.codfw.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [14:03:26] they are the same file. we could probably bike shed this and can wait for a different change but /etc/ssl/certs/Puppet_Internal_CA.pem -> /usr/local/share/ca-certificates/Puppet_Internal_CA.crt [14:03:40] and /usr/local/share/ca-certificates/Puppet_Internal_CA.crt is copied from /var/lib/puppet/ssl/certs/ca.pem via puppet [14:03:47] so i think /var/lib/puppet/ssl/certs/ca.pem makes the most sense [14:03:53] I completely trust you, just wanted to double check [14:03:58] (03PS6) 10Gehel: [mwgrep] only query live indices [puppet] - 10https://gerrit.wikimedia.org/r/587586 (https://phabricator.wikimedia.org/T249435) (owner: 10DCausse) [14:04:00] then I think we are ready [14:05:01] (03CR) 10Jbond: [C: 03+1] "lgtm thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [14:05:23] thanks elukey <3 [14:07:01] (03CR) 10Ema: [C: 03+2] cache: test purged on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/588660 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [14:09:36] (03CR) 10Gehel: [C: 03+2] [mwgrep] only query live indices [puppet] - 10https://gerrit.wikimedia.org/r/587586 (https://phabricator.wikimedia.org/T249435) (owner: 10DCausse) [14:11:16] (03PS1) 10Hashar: beta: update text cache IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588706 (https://phabricator.wikimedia.org/T250085) [14:12:26] !log cp3050: resume purged testing T249583 [14:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:31] jbond42: thank you for the patience! [14:12:32] T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 [14:12:52] (03CR) 10Elukey: [C: 03+2] kafkatee::instance: add types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/588086 (owner: 10Elukey) [14:13:05] (03CR) 10Elukey: [C: 03+2] Enable TLS encryption between kafkatee instances and Kafka [puppet] - 10https://gerrit.wikimedia.org/r/588015 (owner: 10Elukey) [14:13:08] (03PS11) 10Vgutierrez: Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) [14:13:17] (03CR) 10RhinosF1: [C: 03+1] "Seems so simple now we know!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588706 (https://phabricator.wikimedia.org/T250085) (owner: 10Hashar) [14:13:25] (03CR) 10Hashar: [C: 03+2] "Labs only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588706 (https://phabricator.wikimedia.org/T250085) (owner: 10Hashar) [14:13:27] (03PS1) 10Hashar: zuul: add missing ssh public key for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/588707 (https://phabricator.wikimedia.org/T224591) [14:13:29] (03PS1) 10Hashar: Profile to inject Gerrit ssh public key to known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/588708 (https://phabricator.wikimedia.org/T224591) [14:13:35] elukey: alwyas happy to help :) [14:13:58] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/588707 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:14:01] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/588708 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:15:06] !log Rebasing mediawiki-config on deploy1001 for a deployment-prep config change ( https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/588706/ ) [14:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:28] (03Merged) 10jenkins-bot: beta: update text cache IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588706 (https://phabricator.wikimedia.org/T250085) (owner: 10Hashar) [14:15:31] (03CR) 10jerkins-bot: [V: 04-1] zuul: add missing ssh public key for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/588707 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:15:38] !log enable TLS between weblog1001,mwlog2001.codfw.wmnet,mwlog1001 and Kafka Jumbo/Logging - T250147 [14:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:44] T250147: Add TLS encryption support to Kafkatee and enable it where possible - https://phabricator.wikimedia.org/T250147 [14:16:08] Cc: herron, godog, shdubsh --^ [14:18:04] RECOVERY - snapshot of s7 in eqiad on db1115 is OK: snapshot for s7 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-04-14 11:13:00 from db1116.eqiad.wmnet:3317 (944 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:19:01] elukey: sweet, thank you! [14:20:12] (03PS1) 10Dzahn: cloud/devtools: add missing cache_nodes key [puppet] - 10https://gerrit.wikimedia.org/r/588710 [14:20:25] (03CR) 10Vgutierrez: [C: 03+1] Add 0036-VSV00004.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/587783 (https://phabricator.wikimedia.org/T249810) (owner: 10Ema) [14:20:39] godog: just to triple check, can you verify that mwlog* are ok? Not sure how to verify that nothing is exploding besides the kafkatee logs [14:22:05] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10Papaul) @Eevans the IDRAC is not showing any failed drive. Is it possible for you to get me some system logs showing the bad disk so i can upload that when i ask for a disk replacement. The last log i have... [14:23:43] elukey: ack, I'll check too [14:24:49] elukey: ah yeah nothing is flowing on the topics from mwlog, so no activity is expected. I'm assuming if weblog kafkatee works then we're good [14:25:34] yep it is, good! [14:32:03] (03PS2) 10Cwhite: smart: make smart_data_dump importable for adding tests [puppet] - 10https://gerrit.wikimedia.org/r/587795 (https://phabricator.wikimedia.org/T199236) [14:32:05] (03PS3) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [14:32:07] (03PS4) 10Cwhite: smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) [14:33:03] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [14:33:03] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:09] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [14:33:09] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:12] !log power down ms-be1023 - T249174 [14:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:18] T249174: ms-be1023 crashed / Smart Storage Battery failure - https://phabricator.wikimedia.org/T249174 [14:34:54] RECOVERY - snapshot of s3 in eqiad on db1115 is OK: snapshot for s3 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-04-14 11:05:12 from db1095.eqiad.wmnet:3313 (851 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:38:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "Change itself LGTM, I think having this information in the HTML output would be useful as well" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588703 (https://phabricator.wikimedia.org/T250169) (owner: 10Jbond) [14:40:45] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1023 crashed / Smart Storage Battery failure - https://phabricator.wikimedia.org/T249174 (10Jclark-ctr) Replaced failed bbu [14:41:38] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1023 crashed / Smart Storage Battery failure - https://phabricator.wikimedia.org/T249174 (10Jclark-ctr) [14:42:08] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@354ae2d]: Remove rules enabled in k8s T248677 [14:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:13] T248677: Finalise changeprop migration to k8s - https://phabricator.wikimedia.org/T248677 [14:44:06] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@354ae2d]: Remove rules enabled in k8s T248677 (duration: 01m 58s) [14:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:55] hashar: re: jenkins/gerrit ssh key. no, the private repo only has the private file, bot the public one. I am tempted to argue a public file does not have to be in the private repo, but whatever, the other side is then they are together in one place. I will re-create public from private and add it and compare to contint1001 file that is not puppetized [14:46:50] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:47:46] (03PS4) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [14:48:39] (03PS1) 10Hashar: secret: placeholder for jenkins-bot_gerrit_id_rsa.pub [labs/private] - 10https://gerrit.wikimedia.org/r/588713 [14:48:51] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1023 crashed / Smart Storage Battery failure - https://phabricator.wikimedia.org/T249174 (10fgiunchedi) 05Open→03Resolved Looks like we're back, thanks @Jclark-ctr ` Controller Status: OK Hardware Revision: B Firmware Version: 6.88 Rebuil... [14:49:10] (03CR) 10Dzahn: "The private repo has the private key but not the public key. I generated the public key from the private key and compared the result to th" [puppet] - 10https://gerrit.wikimedia.org/r/588707 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:49:21] mutante: yeah I am fixing it :) [14:49:41] hashar: i added the public key in the private repo just now [14:49:47] oh [14:50:06] it matches the file from contint1001 [14:50:11] https://gerrit.wikimedia.org/r/#/c/labs/private/+/588713/ [14:50:17] for the public repo :] [14:50:34] I am updating that repo on the compiler and will recompile [14:50:45] ack, jenkins-bot_gerrit_id_rsa.pub: OpenSSH RSA public key [14:50:59] (03CR) 10Dzahn: [C: 03+1] secret: placeholder for jenkins-bot_gerrit_id_rsa.pub [labs/private] - 10https://gerrit.wikimedia.org/r/588713 (owner: 10Hashar) [14:51:59] (03CR) 10Dzahn: [V: 03+2 C: 03+2] secret: placeholder for jenkins-bot_gerrit_id_rsa.pub [labs/private] - 10https://gerrit.wikimedia.org/r/588713 (owner: 10Hashar) [14:52:03] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.7-rc0-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/588399 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [14:52:30] (03PS5) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [14:52:32] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: add missing cache_nodes key [puppet] - 10https://gerrit.wikimedia.org/r/588710 (owner: 10Dzahn) [14:53:05] (03CR) 10Dzahn: [C: 04-1] "cache_nodes parameter in gerrit is not used. it was added for avatars support which has not happened." [puppet] - 10https://gerrit.wikimedia.org/r/588710 (owner: 10Dzahn) [14:53:30] (03CR) 10Cwhite: smart: add _check_output wrapper method and tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [14:54:00] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: add missing cache_nodes key [puppet] - 10https://gerrit.wikimedia.org/r/588710 (owner: 10Dzahn) [14:54:11] (03PS2) 10Dzahn: cloud/devtools: add missing cache_nodes key [puppet] - 10https://gerrit.wikimedia.org/r/588710 [14:54:16] (03PS2) 10Hashar: zuul: add missing ssh public key for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/588707 (https://phabricator.wikimedia.org/T224591) [14:54:30] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/588707 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:55:00] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/588707 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [14:57:02] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1029, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Jclark-ctr) [15:00:33] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Jclark-ctr) [15:00:35] (03CR) 10Hashar: "Thank you for the verification!" [puppet] - 10https://gerrit.wikimedia.org/r/588707 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [15:02:06] (03PS6) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [15:04:41] (03CR) 10Dzahn: [C: 03+2] zuul: add missing ssh public key for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/588707 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [15:07:23] 10Operations, 10Analytics, 10Research, 10Traffic: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10Nuria) closing as this is happening as part of our monthly sync up. [15:08:55] 10Operations, 10Analytics, 10Research, 10Traffic: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10Nuria) 05Open→03Resolved [15:08:57] (03PS7) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [15:09:09] (03PS2) 10Hashar: Profile to inject Gerrit ssh public key to known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/588708 (https://phabricator.wikimedia.org/T224591) [15:09:16] hashar: contint2001: /var/lib/zuul/.ssh/id_rsa.pub created. contint1001: /var/lib/zuul/.ssh/id_rsa.pub changed content (because the comment was dropped) [15:09:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate update_special_pages to periodic_job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587334 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [15:09:59] mutante: awesome :) [15:10:01] hashar: if any issues on contint1001 there is also a backup in /root of the old file [15:10:37] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10Nuria) I think db access can be granted to all the requestors, might come handy for the future,... [15:10:53] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/588708 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [15:11:11] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:26] (03PS1) 10Jgreen: add waitpid to reap open3 child processes in FrDeploy.pm [software] - 10https://gerrit.wikimedia.org/r/588717 [15:11:37] mutante: perfect; Then I get a change to attempt to add Gerrit service ssh host key to /etc/ssh/known_hosts . But I have no clue how that sshkey magic works in puppet :) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/588708/ [15:12:07] the good thing is that those changes strike some technical debt [15:12:58] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:10] !log upload trafficserver 8.0.7-rc0-1wm2 to apt.wm.o (buster) - T249335 [15:13:13] (03PS1) 10Kormat: Simplify manual ssh host key checking [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/588718 [15:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:16] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [15:14:07] hashar: ah, but i see you are just moving existing code to another place? so i guess that works [15:14:15] (03CR) 10jerkins-bot: [V: 04-1] smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [15:14:40] ah.. labs vs prod [15:15:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate refreshlinks to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587331 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [15:15:27] !log update to ats 8.0.7-rc0-1wm2 on cp[4026,4032] - T249335 [15:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:16] (03CR) 10Jgreen: [C: 03+2] add waitpid to reap open3 child processes in FrDeploy.pm [software] - 10https://gerrit.wikimedia.org/r/588717 (owner: 10Jgreen) [15:16:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate update_flaggedrev_stats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/587328 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [15:17:34] (03CR) 10Hashar: "So myabe that is correct:" [puppet] - 10https://gerrit.wikimedia.org/r/588708 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [15:17:41] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:22] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update cxserver to 2020-04-13-094138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/588540 (https://phabricator.wikimedia.org/T239459) (owner: 10KartikMistry) [15:23:08] 10Operations: fix up log retention on log collection/storage hosts - https://phabricator.wikimedia.org/T92839 (10fgiunchedi) I believe we are in a better place nowadays wrt logs retention in the fleet, ok to resolve or lower priority? [15:29:21] 10Operations, 10Commons, 10Thumbor, 10observability: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937 (10fgiunchedi) Catchpoint has been decommissioned in the meantime, although for this type of check/alert there's certainly the need for sth... [15:33:05] 10Operations, 10Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race on jessie - https://phabricator.wikimedia.org/T148986 (10fgiunchedi) 05Open→03Resolved Boldly resolving since Jessie is deprecated [15:33:07] (03CR) 10CDanis: WIP: add NIC saturation exporter (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588431 (owner: 10CDanis) [15:33:22] (03PS2) 10CDanis: WIP: add NIC saturation exporter [puppet] - 10https://gerrit.wikimedia.org/r/588431 [15:36:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: add NIC saturation exporter [puppet] - 10https://gerrit.wikimedia.org/r/588431 (owner: 10CDanis) [15:37:10] (03PS3) 10CDanis: WIP: add NIC saturation exporter [puppet] - 10https://gerrit.wikimedia.org/r/588431 [15:39:10] (03CR) 10Hashar: "That is one is for the zuul-merger process which does ssh operations with Gerrit and thus require the known-hosts. It seems that previous" [puppet] - 10https://gerrit.wikimedia.org/r/588708 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [15:41:00] 10Operations, 10Pybal, 10Traffic: pybal-related issue on host start can break service IPs... - https://phabricator.wikimedia.org/T113597 (10fgiunchedi) p:05High→03Medium We've been routinely reboot lvs hosts multiple times and IIRC this issue hasn't come up again (?) Lowering priority [15:41:50] rlazarus: since I forgot to say it on gerrit proper, thanks for the review, very helpful :) [15:42:23] (03PS1) 10Hnowlan: changeprop: increase replicas and resources assigned to changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/588721 (https://phabricator.wikimedia.org/T248677) [15:42:40] hashar: i wonder why that gerrit host key is not in contint1001 ssh_known_hosts either though [15:42:53] there is one, but not the rsa one that is on cloud clients [15:43:52] (03CR) 10Ppchelko: changeprop: increase replicas and resources assigned to changeprop (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/588721 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:44:01] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10Samwalton9) [15:44:26] mutante: ah tehre are two keys [15:44:47] mutante: one that is automagically collected by puppet is for ssh to the server (port 22) [15:45:01] the other is for the Gerrit application on port 29418, and that one is unknown to puppet [15:45:01] (03PS2) 10Hnowlan: changeprop: increase replicas and resources assigned to changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/588721 (https://phabricator.wikimedia.org/T248677) [15:45:44] (03CR) 10Ppchelko: [C: 03+2] changeprop: increase replicas and resources assigned to changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/588721 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:46:01] (03Merged) 10jenkins-bot: changeprop: increase replicas and resources assigned to changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/588721 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:46:16] hashar: yea, i realize it's for the other ssh server, but looking at the sshkey puppet resource as you use it in cloud.. it will install by default into /etc/ssh/ssh_known_hosts and it's just not there on contint1001 [15:46:34] 10Operations, 10Analytics: Broken /a/refinery-source/guard/run_all_guards.sh script on stat1002 - https://phabricator.wikimedia.org/T166937 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Boldly resolving, the class has been removed from puppet in I830a80fd7eb [15:46:36] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10fgiunchedi) [15:46:38] and right now it also works on 1001 [15:47:12] ah the etc file must be generated by a class that is not in base [15:47:13] grbb [15:47:24] oh [15:47:32] [contint1001:/etc/ssh] $ grep gerrit ssh_known_hosts [15:47:32] /etc/ssh/ssh_known_hosts :) [15:47:40] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:03] hashar: sshkey{} comes with puppet [15:48:40] By default, this type will install keys into /etc/ssh/ssh_known_hosts. To manage ssh keys in a different known_hosts file, such as a user’s personal known_hosts, pass its path to the target parameter. [15:48:55] (03PS1) 10Ema: Revert "cache: test purged on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/588722 (https://phabricator.wikimedia.org/T249583) [15:49:06] !log 1.35.0-wmf.28 was branched at ded5b87df12cea88d94dde0fa22cac13227f8e92 for T247775 [15:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:13] T247775: 1.35.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T247775 [15:49:21] but there is no target => in the existing code on slave::labs [15:49:59] 10Operations, 10Pybal, 10Traffic: pybal: race condition in alerts instrumentation - https://phabricator.wikimedia.org/T176388 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi AFAICT this issue hasn't reoccurred, boldly resolving [15:49:59] yeah and my patch will add it to the global known hosts [15:50:11] the devil is: will it be added on all the other hosts as well [15:50:12] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:50:13] (03CR) 10Ema: [C: 03+2] Revert "cache: test purged on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/588722 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [15:50:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: increase replicas and resources assigned to changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/588721 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:31] hashar: How does it work on contint1001 right now if that key is missing? [15:50:32] if so, I would move the sshkey definition to the gerrit server class [15:51:01] as for why it works right now, the pub key has been added to /var/lib/zuul/.ssh/known_hosts , I guess that has been done manually [15:51:10] i haven't found any trace in git about it [15:51:18] 10Operations, 10Puppet, 10User-Joe: Passenger spews Exception NoMethodError in Rack application object - https://phabricator.wikimedia.org/T180944 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Looks like this isn't an issue anymore, resolving [15:51:21] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254 (10fgiunchedi) [15:51:28] hashar: ok, so that means you would want to use target => to use it in zuul's homedir [15:51:38] hashar: then the question is how it works on all the cloud clients :) [15:52:09] we can add it in a target indeed [15:52:36] !log cp3050: suspend purged testing, varnish-frontend-restart to clear mailbox lag T249583 [15:52:36] on the cloud clients, is it also in /var/lib/zuul/ ? [15:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:43] T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 [15:52:44] but looks like having it in the global host is fine as well [15:53:19] for cloud, well, the key is in the global known host :] [15:53:29] so that works for any user on the instance [15:54:23] ack, ok [15:54:54] the thing I don't know is whether all sshkey are collected [15:55:00] and then dispatched on every servers [15:55:24] if so, it might be better to define the sshkey resource on the Gerrit role/profile [15:55:36] which puppet running on gerrit1001 will collect and then use it to update every single hosts [15:57:22] (03PS1) 10Filippo Giunchedi: utils: fix hiera Debian package name [puppet] - 10https://gerrit.wikimedia.org/r/588729 [16:00:04] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200414T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:39] hashar: i think that would happen if you use @@sshkey as opposed to just sshkey [16:01:29] i would be fine with doing it as it is but use target => to make it specific for the zuul user. keep copying how it is on contint1001 [16:01:37] should we try using @@sshkey on the Gerrit host? [16:02:15] what would be the advantage of changing things vs the prod server? [16:02:31] don't know :) in both cases I got refactor the puppet change anyway hehe [16:03:25] (03CR) 10RLazarus: [C: 03+1] "LGTM! If this is still WIP I'm happy to review again, but don't need to." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588431 (owner: 10CDanis) [16:03:33] 10Operations, 10Puppet, 10puppet-compiler: Puppet compiler failure to lookup some keys - https://phabricator.wikimedia.org/T185215 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Things work as expected nowadays: ` $ ./utils/hiera_lookup -v --fqdn=ganeti1001.eqiad.wmnet --roles=ganeti profile::ganeti:... [16:03:38] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade hiera to stretch (version 3) - https://phabricator.wikimedia.org/T188623 (10fgiunchedi) [16:03:40] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562 (10fgiunchedi) [16:04:05] (03CR) 10ArielGlenn: [C: 03+2] fix listing of input files for 7z recompression [dumps] - 10https://gerrit.wikimedia.org/r/588393 (https://phabricator.wikimedia.org/T250018) (owner: 10ArielGlenn) [16:04:44] !log ariel@deploy1001 Started deploy [dumps/dumps@90cbab0]: fix listing of input files for 7z recompression [16:04:48] !log ariel@deploy1001 Finished deploy [dumps/dumps@90cbab0]: fix listing of input files for 7z recompression (duration: 00m 04s) [16:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:53] (03PS4) 10CDanis: Add NIC saturation exporter (Python implementation) [puppet] - 10https://gerrit.wikimedia.org/r/588431 [16:04:57] hashar: exporting is in modules/ssh/manifests/server for the FQDN. not sure where the collecting part is yet [16:05:28] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules [16:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:44] hmm let's go with the target => first [16:05:52] (03PS5) 10CDanis: Add NIC saturation exporter (Python implementation) [puppet] - 10https://gerrit.wikimedia.org/r/588431 (https://phabricator.wikimedia.org/T224454) [16:06:06] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog: Map tile generation error - https://phabricator.wikimedia.org/T215120 (10fgiunchedi) p:05High→03Medium Lowering severity, not an emergency indeed [16:06:48] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules (duration: 01m 20s) [16:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:57] !log ariel@deploy1001 Started deploy [dumps/dumps@90cbab0]: fix listing of input files for 7z recompression, retry [16:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:01] !log ariel@deploy1001 Finished deploy [dumps/dumps@90cbab0]: fix listing of input files for 7z recompression, retry (duration: 00m 04s) [16:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:02] PROBLEM - ganeti-mond running on ganeti2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [16:08:08] (03CR) 10Muehlenhoff: utils: fix hiera Debian package name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588729 (owner: 10Filippo Giunchedi) [16:09:25] akosiaris: i saw that alert but also that you are on the shell ^ [16:09:57] mutante: I will refactor it tonight :) in 1/1 with Tyler now then will have dinner [16:10:45] hashar: sounds good. thanks [16:10:46] 10Operations, 10observability, 10serviceops, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10CDanis) The patch now posted here is a reasonably-clean Python implementation of the same idea described in my now-long-ago comment at T2... [16:11:03] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi From the dashboard above, doesn't look like this is reoccurring (?) resolving [16:11:14] 10Operations, 10Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986 (10MoritzMuehlenhoff) 05Resolved→03Open I think we've still seen these on Stretch, let's keep this open until we're sure that this got fixed in Buster. [16:11:20] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:25] 10Operations, 10Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986 (10MoritzMuehlenhoff) [16:13:08] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:32] RECOVERY - ganeti-mond running on ganeti2001 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [16:14:30] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] "Looks good, thanks!" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/588718 (owner: 10Kormat) [16:17:47] (03PS1) 10Cmjohnson: Adding production dns ipv4 & ipv6 for cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/588731 (https://phabricator.wikimedia.org/T247471) [16:18:13] (03CR) 10jerkins-bot: [V: 04-1] Adding production dns ipv4 & ipv6 for cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/588731 (https://phabricator.wikimedia.org/T247471) (owner: 10Cmjohnson) [16:20:36] !log Scap cleaning 1.35.0-wmf.25 T247775 [16:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:44] T247775: 1.35.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T247775 [16:22:36] (03CR) 10Dzahn: Adding production dns ipv4 & ipv6 for cloudcontrol1005 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/588731 (https://phabricator.wikimedia.org/T247471) (owner: 10Cmjohnson) [16:23:56] 10Operations, 10netops: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 (10Volans) The structure looks good to me, we could optionally skip the duplicate `import_policy` and `export_policy` if we don't have cases of override, but it's fine. [16:24:45] (03PS2) 10Cmjohnson: Adding production dns ipv4 & ipv6 for cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/588731 (https://phabricator.wikimedia.org/T247471) [16:24:52] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:25:04] thanks dzahn [16:25:09] mutante [16:26:04] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:41] (03CR) 10Dzahn: [C: 03+1] Adding production dns ipv4 & ipv6 for cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/588731 (https://phabricator.wikimedia.org/T247471) (owner: 10Cmjohnson) [16:26:47] cmjohnson1: you're welcome. looks good [16:26:51] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds like a good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/587985 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [16:28:53] 10Operations, 10observability: production-logstash elastic cluster is yellow state - https://phabricator.wikimedia.org/T250133 (10herron) +1 for option 1 as a non-destructive first step to regain space [16:31:53] (03PS1) 10Jbond: pcc templates: refactor templates to make them more DRY [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588735 (https://phabricator.wikimedia.org/T250169) [16:31:57] (03PS1) 10Jbond: pcc templates: add cli instructions to template footer [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588736 (https://phabricator.wikimedia.org/T250169) [16:33:00] (03PS3) 10Cmjohnson: Adding production dns ipv4 & ipv6 for cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/588731 (https://phabricator.wikimedia.org/T247471) [16:33:41] (03PS1) 10Herron: logstash: reduce replica count to 0 after 80 days [puppet] - 10https://gerrit.wikimedia.org/r/588740 (https://phabricator.wikimedia.org/T250133) [16:35:51] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns ipv4 & ipv6 for cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/588731 (https://phabricator.wikimedia.org/T247471) (owner: 10Cmjohnson) [16:36:02] (03PS2) 10Jbond: cli: add pcc invocation to logging output [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/588703 (https://phabricator.wikimedia.org/T250169) [16:37:48] (03CR) 10Filippo Giunchedi: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/588740 (https://phabricator.wikimedia.org/T250133) (owner: 10Herron) [16:37:52] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: reduce replica count to 0 after 80 days [puppet] - 10https://gerrit.wikimedia.org/r/588740 (https://phabricator.wikimedia.org/T250133) (owner: 10Herron) [16:37:58] !log jforrester@deploy1001 Pruned MediaWiki: 1.35.0-wmf.25 (duration: 17m 20s) [16:38:02] !log stop all ganeti components (VMs are fine) on all ganeti2* hosts for key/cert rollover [16:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:02] (03PS2) 10Herron: logstash: reduce replica count to 0 after 80 days [puppet] - 10https://gerrit.wikimedia.org/r/588740 (https://phabricator.wikimedia.org/T250133) [16:42:30] (03PS1) 10Cmjohnson: Updating dhcp and netboot.cfg files for cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/588743 (https://phabricator.wikimedia.org/T247471) [16:42:52] (03CR) 10Herron: "> LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/588740 (https://phabricator.wikimedia.org/T250133) (owner: 10Herron) [16:45:43] (03PS8) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [16:45:56] (03PS1) 10Jforrester: testwikis wikis to 1.35.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588745 [16:45:58] (03CR) 10Jforrester: [C: 03+2] testwikis wikis to 1.35.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588745 (owner: 10Jforrester) [16:47:32] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588745 (owner: 10Jforrester) [16:49:14] !log jforrester@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.28 [16:49:18] !log jforrester@deploy1001 sync aborted: testwikis wikis to 1.35.0-wmf.28 (duration: 00m 05s) [16:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:08] Hmm. [16:51:37] (03CR) 10jerkins-bot: [V: 04-1] smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [16:52:22] !log jforrester@deploy1001 Started scap: Testwikis to php-1.35.0-wmf.28 and rebuild i18n cache for T247775 [16:52:23] (03PS1) 10Hnowlan: changeprop: make kafka SSL configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/588749 (https://phabricator.wikimedia.org/T248677) [16:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:28] T247775: 1.35.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T247775 [16:54:42] PROBLEM - ganeti-mond running on ganeti2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [16:54:53] (03PS9) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [16:55:14] PROBLEM - ganeti-confd running on ganeti2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [16:55:40] 10Operations, 10Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986 (10jcrespo) Also this would need to be reverted: https://phabricator.wikimedia.org/T148986#3850836 before closing the ticket, but I don't think we have a proxy with bu... [16:56:34] RECOVERY - ganeti-mond running on ganeti2007 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [16:56:57] (03CR) 10Ppchelko: changeprop: make kafka SSL configurable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/588749 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [16:57:04] RECOVERY - ganeti-confd running on ganeti2001 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [16:58:18] (03CR) 10Ppchelko: [C: 03+2] changeprop: make kafka SSL configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/588749 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [16:58:36] (03Merged) 10jenkins-bot: changeprop: make kafka SSL configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/588749 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [16:58:40] (03CR) 10Ppchelko: changeprop: make kafka SSL configurable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/588749 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [16:59:42] (03CR) 10CDanis: [C: 03+2] Add NIC saturation exporter (Python implementation) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588431 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis) [16:59:44] mutante: ok, fixed now. Adding info to wikitech [17:00:04] halfak and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200414T1700). [17:02:51] (03CR) 10Cwhite: [C: 03+2] smart: make smart_data_dump importable for adding tests [puppet] - 10https://gerrit.wikimedia.org/r/587795 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:04:41] (03CR) 10Cwhite: "added tox changes so that tests are run in integration" [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:05:31] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:45] (03PS2) 10Cwhite: smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) [17:06:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:29] (03PS5) 10Cwhite: smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) [17:06:45] (03PS3) 10Cwhite: smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) [17:06:48] (03CR) 10jerkins-bot: [V: 04-1] smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:06:55] (03PS2) 10Cwhite: smart: simplify PD [puppet] - 10https://gerrit.wikimedia.org/r/588515 (https://phabricator.wikimedia.org/T199236) [17:06:56] akosiaris: i presume that was about ganeti? thanks [17:07:16] (03CR) 10jerkins-bot: [V: 04-1] smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:07:18] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:57] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10CDanis) [17:08:00] (03CR) 10jerkins-bot: [V: 04-1] smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:08:14] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:08:25] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:54] (03CR) 10jerkins-bot: [V: 04-1] smart: simplify PD [puppet] - 10https://gerrit.wikimedia.org/r/588515 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:09:20] (03CR) 10Cmjohnson: [C: 03+2] Updating dhcp and netboot.cfg files for cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/588743 (https://phabricator.wikimedia.org/T247471) (owner: 10Cmjohnson) [17:09:36] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01146 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:10:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10Cmjohnson) [17:10:42] (03PS1) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [17:11:08] (03PS1) 10Elukey: Set 10G for the JVM's young gen size to cloudelastic-chi [puppet] - 10https://gerrit.wikimedia.org/r/588753 (https://phabricator.wikimedia.org/T231517) [17:11:22] chaomodus: yes. Adding entry to wikitech right now about it [17:12:14] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@354ae2d]: Remove rules enabled in k8s T248677 attempt 2 [17:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:22] T248677: Finalise changeprop migration to k8s - https://phabricator.wikimedia.org/T248677 [17:12:40] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@354ae2d]: Remove rules enabled in k8s T248677 attempt 2 (duration: 00m 25s) [17:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:57] (03PS2) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [17:14:27] (03PS4) 10Cwhite: smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) [17:15:52] (03CR) 10jerkins-bot: [V: 04-1] smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:15:54] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/21921/" [puppet] - 10https://gerrit.wikimedia.org/r/588753 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [17:16:56] (03CR) 10jerkins-bot: [V: 04-1] Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:19:01] (03PS6) 10Cwhite: smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) [17:19:03] (03PS5) 10Cwhite: smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) [17:19:05] (03PS3) 10Cwhite: smart: simplify PD [puppet] - 10https://gerrit.wikimedia.org/r/588515 (https://phabricator.wikimedia.org/T199236) [17:21:20] (03PS10) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [17:21:30] (03CR) 10jerkins-bot: [V: 04-1] smart: simplify PD [puppet] - 10https://gerrit.wikimedia.org/r/588515 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:21:39] (03PS7) 10Cwhite: smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) [17:21:50] (03PS6) 10Cwhite: smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) [17:21:59] (03PS4) 10Cwhite: smart: simplify PD [puppet] - 10https://gerrit.wikimedia.org/r/588515 (https://phabricator.wikimedia.org/T199236) [17:22:14] (03PS3) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [17:23:16] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@354ae2d]: Rollback removing k8s rules, again [17:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:20] !log ppchelko@deploy1001 deploy aborted: Rollback removing k8s rules, again (duration: 00m 05s) [17:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:47] (03PS1) 10EBernhardson: cirrus: redirect more_like to codfw to rebuilt query cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588754 [17:24:32] 10Operations, 10SRE-tools, 10Traffic, 10Continuous-Integration-Config, and 4 others: Integrate automated DNS snippets into CI - https://phabricator.wikimedia.org/T243362 (10crusnov) 05Open→03Resolved This has been complete for some time. [17:24:37] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10crusnov) [17:24:43] (03PS2) 10EBernhardson: cirrus: redirect more_like to codfw to rebuild query cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588754 [17:25:48] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules, again [17:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:44] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@baf0a4b]: Rollback removing k8s rules, again (duration: 00m 56s) [17:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:51] (03PS4) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [17:27:01] 10Operations, 10Core Platform Team, 10Performance-Team, 10Traffic, 10serviceops: Reduce rate of purges emitted by Mediawiki - https://phabricator.wikimedia.org/T250205 (10Joe) [17:28:18] (03CR) 10EBernhardson: "This needs to be deployed before wmf.27 rolls forward" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588754 (owner: 10EBernhardson) [17:29:05] (03PS5) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [17:29:18] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10Eevans) >>! In T250050#6055022, @Eevans wrote: > > [ ... ] > > Once complete we'll need to do the `b` & `c` instances as well. Once //that// is complete, we can either re-image the node entirely, or replac... [17:29:31] (03PS3) 10EBernhardson: cirrus: redirect more_like to codfw to rebuild query cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588754 [17:29:56] (03PS1) 10Cwhite: smart: move metrics registry and metrics init to global [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) [17:33:13] (03CR) 10jerkins-bot: [V: 04-1] Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:34:59] !log jforrester@deploy1001 Finished scap: Testwikis to php-1.35.0-wmf.28 and rebuild i18n cache for T247775 (duration: 42m 37s) [17:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:06] T247775: 1.35.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T247775 [17:35:57] Finally. [17:36:33] chaomodus: https://wikitech.wikimedia.org/wiki/Ganeti#Cluster_certificates. cluster cert had expired in codfw [17:36:45] ah! cool thanks [17:36:49] (03PS6) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [17:37:02] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.0006365 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:37:56] (03PS1) 10CDanis: puppetize nic_saturation_exporter & run on memcache hosts [puppet] - 10https://gerrit.wikimedia.org/r/588760 (https://phabricator.wikimedia.org/T224454) [17:40:08] (03CR) 10EBernhardson: [C: 03+1] Set 10G for the JVM's young gen size to cloudelastic-chi [puppet] - 10https://gerrit.wikimedia.org/r/588753 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [17:40:29] (03CR) 10jerkins-bot: [V: 04-1] Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [17:40:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcontrol1005.wikime... [17:41:06] 10Operations, 10netbox, 10Patch-For-Review: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) - https://phabricator.wikimedia.org/T209182 (10crusnov) So the task for this is: [ ] Automate downloading of swift container contents on Netbox frontends [ ] Setup bacul... [17:41:38] (03PS2) 10CDanis: puppetize nic_saturation_exporter & run on memcache hosts [puppet] - 10https://gerrit.wikimedia.org/r/588760 (https://phabricator.wikimedia.org/T224454) [17:42:54] (03PS7) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [17:43:02] (03PS3) 10CDanis: puppetize nic_saturation_exporter & run on memcache hosts [puppet] - 10https://gerrit.wikimedia.org/r/588760 (https://phabricator.wikimedia.org/T224454) [17:45:18] (03CR) 10CDanis: "PCC lgtm: https://puppet-compiler.wmflabs.org/compiler1003/21927/" [puppet] - 10https://gerrit.wikimedia.org/r/588760 (https://phabricator.wikimedia.org/T224454) (owner: 10CDanis) [17:51:50] 10Operations, 10netbox: Netbox racks consistency report - https://phabricator.wikimedia.org/T212878 (10crusnov) I gather there is some subset of the report that is reasonable? If that's the case, can we call that out. As far as I can glean, it's some measure of consistency in the front-backness of MSW and ASW... [17:52:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/588387 (owner: 10Jbond) [17:53:09] 10Operations, 10SRE-tools, 10netbox, 10Patch-For-Review: Cumin: add backend for Netbox - https://phabricator.wikimedia.org/T205900 (10crusnov) This is merely waiting for FQDN information to be stored in Netbox which is an 80% WIP. [17:54:21] 10Operations, 10netbox: Error in postgres puppettization for new installation (was Netbox: postgres cannot be restarted w/ current config) - https://phabricator.wikimedia.org/T184634 (10crusnov) I believe this is fixed. [17:55:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcontrol1005.wikimedia.org'] ` Of which those **FAILED**: ` ['cloudco... [17:57:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:01] ebernhardson: Need to deploy that more_like change now?> [17:59:40] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200414T1800) [18:00:19] ebernhardson: If so, go for it; train is all ready and raring to go. [18:04:17] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10faidon) So breaking down the (very reasonable!) ask, I think there are afew different things at play here: * Access to iDRAC/iLO so that John can e.g. look at HW status... [18:05:32] 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10Quiddity) a:05Quiddity→03Dzahn (Just fixing who resolved it, for the record) [18:08:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10Cmjohnson) [18:11:53] (03PS1) 10Cmjohnson: Adding new cloudcontrol1005 as role insetup to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/588766 (https://phabricator.wikimedia.org/T247471) [18:14:28] (03PS2) 10Cmjohnson: Adding new cloudcontrol1005 as role insetup to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/588766 (https://phabricator.wikimedia.org/T247471) [18:14:53] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10MoritzMuehlenhoff) >>! In T249916#6056390, @faidon wrote: > So breaking down the (very reasonable!) ask, I think there are afew different things at play here: > * Acces... [18:21:27] (03CR) 10Cmjohnson: [C: 03+2] Adding new cloudcontrol1005 as role insetup to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/588766 (https://phabricator.wikimedia.org/T247471) (owner: 10Cmjohnson) [18:22:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` c... [18:38:52] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:28] (03CR) 10Herron: [C: 03+2] logstash: reduce replica count to 0 after 80 days [puppet] - 10https://gerrit.wikimedia.org/r/588740 (https://phabricator.wikimedia.org/T250133) (owner: 10Herron) [18:45:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcontrol1005.wikimedia.org'] ` and were **ALL**... [18:54:23] i need to ship an easy-ish config patch before the train rolls, suspect the train moving forward is going to invalidate a cache and need to prepare [18:54:56] (03CR) 10Herron: "Please see a few thoughts inline from a cursory look. Is there a testing env stood up yet where this config could be validated along with" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [18:54:58] (i suppose its not strictly needed first since its only group 0, but seems should ship it) [18:57:38] (03CR) 10EBernhardson: [C: 03+2] cirrus: redirect more_like to codfw to rebuild query cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588754 (owner: 10EBernhardson) [18:58:03] ebernhardson: Very well. [18:58:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:58:06] (03PS1) 10Cwhite: smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) [18:58:09] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:58:39] (03Merged) 10jenkins-bot: cirrus: redirect more_like to codfw to rebuild query cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588754 (owner: 10EBernhardson) [18:59:21] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:59:50] 10Operations, 10DC-Ops: determine/process/document bios firmware tracking/updating policies - https://phabricator.wikimedia.org/T141128 (10wiki_willy) Update - per Dell, there's up to a 30-day delay with the factory approved bios/firmware upgrades from the time that they're posted on the web. So some of the r... [19:00:04] James_F and liw: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200414T1900). [19:00:32] !log ebernhardson@deploy1001 Started scap: wmf-config/PoolCounterSettings.php cirrus: increase pool counter size for traffic shift to codfw [19:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:43] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:00:53] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:01:54] 10Operations, 10netbox: Netbox racks consistency report - https://phabricator.wikimedia.org/T212878 (10faidon) 05Open→03Declined I think the original intention of this will be addressed by periodic audits that we'll eventually do. I'll decline this for the reasons I mentioned above, but if anyone feels str... [19:02:30] ebernhardson: Clear? [19:02:37] Oh, no. [19:02:54] James_F: almost...my memory failed and i ran sync instead of sync-file, so it'll take a couple extra minutes [19:03:10] More than a "couple". [19:03:22] It'll take ages. [19:03:23] hmm, last time i ran it it was < 10m? i suppose its been awhile [19:03:43] ebernhardson: You're 64 minutes into the "thou must not deploy" window. [19:03:44] 10Operations, 10netbox: Netbox racks consistency report - https://phabricator.wikimedia.org/T212878 (10crusnov) Thanks! [19:04:05] I asked over an hour ago if you needed to deploy this patch. :-) [19:04:32] James_F: doh, i missed that. [19:04:38] Clearly. :-) [19:04:46] scap sync can take up to 2 hours. [19:04:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [19:04:57] youch, umm :S [19:04:59] Though at this point it should "only" take 20 mins or so. [19:05:24] We really should put in a "are you really damn sure?" prompt for scap sync. [19:05:44] Or rename it to `scap sync-world`. [19:06:06] sync-world would make sense to me [19:06:28] this isn't the first time i've accidently run sync instead of sync-file :( [19:06:31] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:06:53] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:07:04] (03PS8) 10Cwhite: smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) [19:08:01] (03PS7) 10Cwhite: smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) [19:09:08] ebernhardson: Filed as T250223. [19:09:08] T250223: Rename `scap sync` to `scap sync-world` and prompt before proceeding - https://phabricator.wikimedia.org/T250223 [19:10:06] it's on sync-pull-masters, so hopefully not much longer [19:13:19] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:15:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:15:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10Cmjohnson) [19:16:35] * James_F sighs at Commons pages causing timeouts so much. [19:16:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10Cmjohnson) 05Open→03Resolved all tasks have been completed, resolving [19:17:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:20:09] Hello, when I want to rebase my patches in https://gerrit.wikimedia.org/r/#/projects/mediawiki/extensions,dashboards/default which have "merge conflict" I get internal server error (error 500) [19:20:39] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:21:03] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:22:28] !log ebernhardson@deploy1001 Finished scap: wmf-config/PoolCounterSettings.php cirrus: increase pool counter size for traffic shift to codfw (duration: 21m 55s) [19:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:48] James_F: all yours, sorry again...was supposed to take 2 minutes :( [19:22:52] ebernhardson: For IS please just use scap-sync [19:23:05] it shouldn't need a second sync, the last one shipped both files [19:23:13] Oh, of course it did. [19:23:27] (03PS1) 10Jforrester: group0 wikis to 1.35.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588770 [19:23:29] (03CR) 10Jforrester: [C: 03+2] group0 wikis to 1.35.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588770 (owner: 10Jforrester) [19:24:36] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588770 (owner: 10Jforrester) [19:26:12] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.28 [19:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:32] 10Operations, 10Analytics, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10ovasileva) [19:27:58] 10Operations, 10Product-Analytics, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10ovasileva) [19:28:07] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:39] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:31:43] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:32:03] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:33:33] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:33:57] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:41:14] I still get internal server error 500 when I want to rebase my patches which have merge conflict in https://gerrit.wikimedia.org/r/#/projects/mediawiki/extensions,dashboards/default ..... [19:43:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:44:33] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:46:15] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:46:59] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:47:11] !log cdanis@cumin1001 dbctl commit (dc=all): '+weight on db1104@s8', diff saved to https://phabricator.wikimedia.org/P10974 and previous config saved to /var/cache/conftool/dbconfig/20200414-194710-cdanis.json [19:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:05] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:48:49] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:50:29] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:51:01] !log cdanis@cumin1001 dbctl commit (dc=all): 'more weight to db1104', diff saved to https://phabricator.wikimedia.org/P10975 and previous config saved to /var/cache/conftool/dbconfig/20200414-195100-cdanis.json [19:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:39] (03PS5) 10Cwhite: smart: simplify PD [puppet] - 10https://gerrit.wikimedia.org/r/588515 (https://phabricator.wikimedia.org/T199236) [19:52:25] (03PS2) 10Cwhite: smart: move metrics registry and metrics init to global [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) [19:53:51] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:54:06] 10Operations, 10DBA: move db1114 to s8 - https://phabricator.wikimedia.org/T250224 (10CDanis) [19:55:48] (03PS2) 10Cwhite: smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) [19:56:01] (03PS8) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [19:57:33] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:57:34] !log cdanis@cumin1001 dbctl commit (dc=all): '+db1111, -db1126', diff saved to https://phabricator.wikimedia.org/P10976 and previous config saved to /var/cache/conftool/dbconfig/20200414-195734-cdanis.json [19:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:58:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'reduce db1126 weight due to cpu issues', diff saved to https://phabricator.wikimedia.org/P10977 and previous config saved to /var/cache/conftool/dbconfig/20200414-195855-marostegui.json [19:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:01] (03PS9) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [19:59:26] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:59:30] 10Operations, 10Core Platform Team, 10Performance-Team, 10Traffic, 10serviceops: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Krinkle) [19:59:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:00:25] 10Operations, 10Core Platform Team, 10Performance-Team, 10Traffic, 10serviceops: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Krinkle) [20:01:13] 10Operations, 10Core Platform Team, 10Traffic, 10serviceops, 10Performance-Team (Radar): Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Gilles) [20:02:59] (03CR) 10jerkins-bot: [V: 04-1] Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) (owner: 10Andrew Bogott) [20:04:58] (03PS10) 10Andrew Bogott: Designate: replace standalone memcached with a mcrouter cluster [puppet] - 10https://gerrit.wikimedia.org/r/588752 (https://phabricator.wikimedia.org/T249941) [20:07:14] 10Operations, 10Parsing-Team, 10TechCom, 10serviceops, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Gilles) [20:11:32] (03CR) 10Hashar: [C: 03+1] "I had missed users/uid :D" [puppet] - 10https://gerrit.wikimedia.org/r/588387 (owner: 10Jbond) [20:12:35] 10Operations, 10Core Platform Team, 10Traffic, 10serviceops, 10Performance-Team (Radar): Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10daniel) p:05Triage→03Medium @Joe You are assigned to this ticket, is this something you are going to work on in the code?... [20:14:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change s8 weights', diff saved to https://phabricator.wikimedia.org/P10978 and previous config saved to /var/cache/conftool/dbconfig/20200414-201412-marostegui.json [20:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:39] !log Adding Create-Signed-Tag right to wikimedia-ui-base group for wikimedia-ui-base repo [20:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:27] !log cdanis@cumin1001 dbctl commit (dc=all): 'tweak db1111 weight yet again', diff saved to https://phabricator.wikimedia.org/P10979 and previous config saved to /var/cache/conftool/dbconfig/20200414-203426-cdanis.json [20:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:01] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10KFrancis) >>! In T249873#6054118, @fgiunchedi wrote: > Also cc @KFrancis for NDA confirmation, thanks! @fgiunchedi I confirmed with T&C, an NDA for Jim Maddock i... [20:46:23] PROBLEM - Check the last execution of mediawiki_job_parser_cache_purging on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_parser_cache_purging https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:47:19] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:21] 10Operations, 10Commons, 10Traffic, 10Wikimedia-General-or-Unknown: 500, Internal Server Error on Commons for images at specified size - https://phabricator.wikimedia.org/T250211 (10Reedy) [20:58:31] "Apr 14 20:43:25 mwmaint1002 mediawiki_job_parser_cache_purging[210524]: Cannot purge this kind of parser cache." [20:59:11] those systemd alerts from mwmaint1002 are from me migrating cronjobs to systemd timers -- among other things it means we'll start getting icinga alerts from maintenance scripts that might have been quietly broken all along [20:59:42] that last mediawiki_job_parser_cache_purging alert smells very much like that, I'll file a task [21:03:29] !log depool wdqs1006 to give it a chance to catch up on lag [21:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:03] 10Operations, 10Commons, 10Traffic, 10Wikimedia-General-or-Unknown: 500, Internal Server Error on Commons for images at specified size - https://phabricator.wikimedia.org/T250211 (10Aklapper) [21:29:18] (03CR) 10Hashar: "We will want to first disable python 3.4 which is https://gerrit.wikimedia.org/r/#/c/operations/software/keyholder/+/485706/ so namely r" [software/keyholder] - 10https://gerrit.wikimedia.org/r/485705 (owner: 10Faidon Liambotis) [21:36:25] !log pool wdqs1006, it is caught up [21:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:38] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 3 others: cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10QEDK) >>! In T249325#6054065, @Urbanecm wrote: > I think this should have it's priority i... [21:53:30] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10leila) @bmansurov let me know if my help is needed. Otherwise, I assume you're on it. [22:06:10] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: WANObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10Krinkle) [22:13:09] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:15:15] 10Operations, 10MediaWiki-Maintenance-system, 10serviceops: purgeParserCache.php: Cannot purge this kind of parser cache - https://phabricator.wikimedia.org/T250231 (10RLazarus) [22:16:29] ACKNOWLEDGEMENT - Check the last execution of mediawiki_job_parser_cache_purging on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_parser_cache_purging RLazarus https://phabricator.wikimedia.org/T250231 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:24:07] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:34:23] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: WANObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10Krinkle) [22:36:46] (03PS1) 10Jhedden: cloudvps: Add metricsinfra prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/588803 (https://phabricator.wikimedia.org/T250206) [22:38:57] 10Operations, 10MediaWiki-Maintenance-system, 10serviceops: purgeParserCache.php: Cannot purge this kind of parser cache - https://phabricator.wikimedia.org/T250231 (10Krinkle) [22:39:06] 10Operations, 10MediaWiki-Cache, 10serviceops: purgeParserCache.php: Cannot purge this kind of parser cache - https://phabricator.wikimedia.org/T250231 (10Krinkle) [22:40:37] (03CR) 10Jhedden: "The blackbox exporter is currently not used, but I'm pretty sure we'll use it soon to monitor things like HTTP/S, TCP, ICMP" [puppet] - 10https://gerrit.wikimedia.org/r/588803 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [22:40:58] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: Add metricsinfra prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/588803 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [22:42:26] (03PS2) 10Jhedden: cloudvps: Add metricsinfra prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/588803 (https://phabricator.wikimedia.org/T250206) [22:45:06] 10Operations, 10MediaWiki-Cache, 10serviceops: purgeParserCache.php: Cannot purge this kind of parser cache - https://phabricator.wikimedia.org/T250231 (10Krinkle) The error message incorrect/deceptive. It was written to account for BagOStuff implementations that lack an `deleteObjectsExpiringBefore` impleme... [22:57:03] 10Operations, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar), 10codfw-rollout: [RFC] improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10Krinkle) [23:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200414T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:04:56] (03PS3) 10Jhedden: cloudvps: Add metricsinfra prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/588803 (https://phabricator.wikimedia.org/T250206) [23:06:30] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Handle SMART for multiple shelves and controllers - https://phabricator.wikimedia.org/T199236 (10colewhite) a:03colewhite [23:09:46] (03CR) 10Cwhite: [C: 03+1] logstash: log safepoints only when running the daemon [puppet] - 10https://gerrit.wikimedia.org/r/587705 (https://phabricator.wikimedia.org/T221052) (owner: 10Filippo Giunchedi) [23:11:02] 10Operations, 10MediaWiki-Parser, 10serviceops: purgeParserCache.php: Cannot purge this kind of parser cache - https://phabricator.wikimedia.org/T250231 (10Krinkle) [23:13:07] 10Operations, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Krinkle) [23:13:48] 10Operations, 10DBA, 10MediaWiki-Parser, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Krinkle) [23:26:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10bd808) @JHedden it looks like @Cmjohnson has our new cloudcontrol ready to go! [23:48:37] (03PS9) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [23:49:20] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [23:50:52] (03PS10) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [23:51:10] (03CR) 10CRusnov: "> Patch Set 8:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov)