[00:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200109T0000). [00:00:04] MatmaRex and RoanKattouw: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:16] hi [00:02:05] Hello [00:02:07] I'll do the SWAT [00:02:34] MatmaRex: Peter was OK with making me sad? :-( [00:02:36] (03CR) 10Catrope: [C: 03+2] Remove 2017 wikitext editor as default on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561649 (owner: 10Bartosz Dziewoński) [00:03:34] (03Merged) 10jenkins-bot: Remove 2017 wikitext editor as default on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561649 (owner: 10Bartosz Dziewoński) [00:03:47] James_F: i didn't mention your feelings when we talked about it :'( [00:03:55] Hmmph. [00:11:20] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Set newcomer tasks config title ahead of deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562987 (https://phabricator.wikimedia.org/T233465) (owner: 10Catrope) [00:14:06] (03PS2) 10Catrope: GrowthExperiments: Set newcomer tasks config title ahead of deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562987 (https://phabricator.wikimedia.org/T233465) [00:14:12] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Set newcomer tasks config title ahead of deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562987 (https://phabricator.wikimedia.org/T233465) (owner: 10Catrope) [00:14:14] (03PS1) 10ArielGlenn: fix up temp stub generation for cases where we rerun part of a part [dumps] - 10https://gerrit.wikimedia.org/r/562995 (https://phabricator.wikimedia.org/T242209) [00:14:16] (03PS1) 10Bstorm: Apply black formatting and make the webservice frontend pass flake8 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/562996 (https://phabricator.wikimedia.org/T236202) [00:15:16] (03Merged) 10jenkins-bot: GrowthExperiments: Set newcomer tasks config title ahead of deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562987 (https://phabricator.wikimedia.org/T233465) (owner: 10Catrope) [00:20:29] (03Abandoned) 10CRusnov: rotatedump: Change to overwriting the daily timestamp dump rather than hour timestamps [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/546241 (owner: 10CRusnov) [00:33:00] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: (no-op) set config page for newcomer tasks (T233465) (duration: 01m 05s) [00:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:08] T233465: Newcomer tasks: article configurations for topics - https://phabricator.wikimedia.org/T233465 [00:37:15] (03CR) 10CDanis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [00:45:14] RoanKattouw: thanks for swatting btw :) [00:47:00] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs202[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) [00:47:23] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs202[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) [00:47:54] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs202[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) [00:48:18] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs202[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) p:05Triage→03Normal [00:49:12] 10Operations, 10SRE-Access-Requests, 10Security: Please grant dsharpe temporary access to mendelevium.eqiad.wmnet - https://phabricator.wikimedia.org/T242113 (10Dsharpe) The investigation is now 100% done. Please remove my (dsharpe) access from server mendelevium.eqiad.wmnet. Thank you so much!!! [00:57:15] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs202[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) [00:58:34] (03PS1) 10Bstorm: kubernetes: persist the cpu and mem args in service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563003 (https://phabricator.wikimedia.org/T236202) [01:00:04] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200109T0100). [01:01:18] (03PS2) 10Bstorm: kubernetes: persist the cpu and mem args in service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563003 (https://phabricator.wikimedia.org/T236202) [01:19:03] (03PS1) 10Papaul: DNS:Add mgmt and production DNS for mc-gp200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/563004 [01:30:08] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) @elukey are we installing Buster or Stretch ? [02:15:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={atlas_exporter,pdu_sentry4} site={codfw,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:17:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:18:37] (03CR) 10Dzahn: [C: 03+1] DNS:Add mgmt and production DNS for mc-gp200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/563004 (owner: 10Papaul) [02:19:09] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.wikimedia.org - https://phabricator.wikimedia.org/T241337 (10Papaul) [02:22:05] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.wikimedia.org - https://phabricator.wikimedia.org/T241337 (10Papaul) [02:32:15] (03CR) 10Dzahn: [C: 03+1] Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (owner: 10Paladox) [02:36:10] (03CR) 10Dzahn: profile::url_downloader: Add types and switch to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [02:43:49] (03CR) 10Papaul: [C: 03+2] DNS:Add mgmt and production DNS for mc-gp200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/563004 (owner: 10Papaul) [02:43:52] (03PS1) 10Dzahn: hieradata/labs: url_downloader settings for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/563016 [02:44:16] (03CR) 10Dzahn: profile::url_downloader: Add types and switch to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [02:45:22] 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) [02:45:35] (03PS2) 10Dzahn: hieradata/labs: url_downloader settings for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/563016 [02:54:37] (03PS1) 10Dzahn: site: remove phab1003, decom [puppet] - 10https://gerrit.wikimedia.org/r/563020 (https://phabricator.wikimedia.org/T238957) [02:57:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to RESTBase for clarakosi - https://phabricator.wikimedia.org/T242152 (10Dzahn) p:05Triage→03High [03:03:06] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) [03:04:07] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) [03:08:11] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) [03:11:42] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) [03:16:53] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) Welcome @hnowlan! This checklist is from a template for onboarding in SRE. I started by addin... [03:18:12] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) p:05Triage→03High [03:39:44] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: LDF server has 404 errors for JS and CSS resources - https://phabricator.wikimedia.org/T237165 (10Dzahn) p:05Triage→03High [03:41:18] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers in perf-team group for dpifke - https://phabricator.wikimedia.org/T242189 (10Dzahn) p:05Triage→03High [05:03:36] (03CR) 10Ayounsi: [C: 03+1] fastnetmon: remove UDP and ICMP limits [puppet] - 10https://gerrit.wikimedia.org/r/562387 (https://phabricator.wikimedia.org/T241374) (owner: 10CDanis) [05:20:40] (03CR) 10Effie Mouzeli: [C: 03+1] "+1 for thumbor, but theemin's /srv directory is currently empty, are we ok with this?" [puppet] - 10https://gerrit.wikimedia.org/r/561852 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [05:21:10] (03CR) 10Effie Mouzeli: [C: 03+1] "LGTM for the rdb* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/562778 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [05:38:34] 10Operations, 10Mail: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 - https://phabricator.wikimedia.org/T240906 (10ayounsi) * Is it always the same source Watchmouse probe failing or "random" ones? * What does the check do exactly? (TCP, more L7 checks?) * Is the check configured to... [05:45:31] (03CR) 10Ayounsi: [C: 03+1] Add port 4192 to term eventgate-analytics in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/562842 (https://phabricator.wikimedia.org/T242224) (owner: 10Elukey) [06:16:56] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1100 - https://phabricator.wikimedia.org/T241506 (10Marostegui) Thanks! [06:21:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1096:3315 db1096:3316 T239453', diff saved to https://phabricator.wikimedia.org/P10092 and previous config saved to /var/cache/conftool/dbconfig/20200109-062157-marostegui.json [06:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:03] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:26:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2088:3312 T239453', diff saved to https://phabricator.wikimedia.org/P10093 and previous config saved to /var/cache/conftool/dbconfig/20200109-062608-marostegui.json [06:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:03] !log Remove revision partitions from db2088:3312 T239453 [06:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:05] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:30:06] (03CR) 10Marostegui: "There is an existing RO user called gerritro which only has SELECT grants, so you can probably use that user." [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [07:20:31] (03PS1) 10Marostegui: install_server: Do not reimage db1107 [puppet] - 10https://gerrit.wikimedia.org/r/563050 [07:21:42] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1107 [puppet] - 10https://gerrit.wikimedia.org/r/563050 (owner: 10Marostegui) [07:23:18] 10Operations, 10ops-codfw: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10elukey) >>! In T239249#5788574, @Papaul wrote: > @elukey are we installing Buster or Stretch ? Buster please! :) [07:25:55] (03PS2) 10Elukey: turnilo: update configuration for webrequest_sampled_128 [puppet] - 10https://gerrit.wikimedia.org/r/562958 (https://phabricator.wikimedia.org/T240681) (owner: 10Joal) [07:26:43] (03CR) 10Marostegui: [C: 03+1] "> This host has been shut down today (by the decom script)" [puppet] - 10https://gerrit.wikimedia.org/r/552607 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [07:28:33] (03CR) 10Elukey: [C: 03+2] turnilo: update configuration for webrequest_sampled_128 [puppet] - 10https://gerrit.wikimedia.org/r/562958 (https://phabricator.wikimedia.org/T240681) (owner: 10Joal) [07:29:18] * elukey wants db1107 back, it is held hostage by marostegui [07:29:25] hahaha [07:29:44] Too late [07:29:47] good morning :) [07:29:53] glad that it is used! [07:29:55] \o/ [07:30:02] db1107 already decided that we treat it better than you, it doesn't want to go back [07:30:27] I can imagine, db1107 is definitely right [07:31:40] It is also running the latest mariadb (10.4), so not a bad life! [07:32:44] 10Operations, 10netops: Stale LibreNMS ports - https://phabricator.wikimedia.org/T242318 (10ayounsi) p:05Triage→03Low [07:37:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118', diff saved to https://phabricator.wikimedia.org/P10094 and previous config saved to /var/cache/conftool/dbconfig/20200109-073713-marostegui.json [07:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:29] !log Upgrade db1118 [07:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:53] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:40:32] !log enable traceoptions for BFD on cr2-eqdfw - T240659 [07:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:35] T240659: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 [07:48:10] (03PS1) 10KartikMistry: WIP: Add config for OpusMT [deployment-charts] - 10https://gerrit.wikimedia.org/r/563110 (https://phabricator.wikimedia.org/T234194) [08:05:34] 10Puppet, 10VPS-project-codesearch: Puppetize codesearch - https://phabricator.wikimedia.org/T242319 (10Legoktm) [08:11:06] (03PS1) 10Legoktm: Initial puppetization of codesearch [puppet] - 10https://gerrit.wikimedia.org/r/563114 (https://phabricator.wikimedia.org/T242319) [08:11:55] (03CR) 10jerkins-bot: [V: 04-1] Initial puppetization of codesearch [puppet] - 10https://gerrit.wikimedia.org/r/563114 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [08:12:36] (03CR) 10Legoktm: "Untested, I don't really know puppet so most of this was cobbled together via copy/paste." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/563114 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [08:13:12] 10Operations, 10netops: Stale LibreNMS ports - https://phabricator.wikimedia.org/T242318 (10ayounsi) From @Marostegui, the list of tables that have rows with `device_id = 20`: P10095#59005 [08:14:47] (03PS2) 10Legoktm: Initial puppetization of codesearch [puppet] - 10https://gerrit.wikimedia.org/r/563114 (https://phabricator.wikimedia.org/T242319) [08:16:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1118', diff saved to https://phabricator.wikimedia.org/P10096 and previous config saved to /var/cache/conftool/dbconfig/20200109-081629-marostegui.json [08:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:52] (03PS1) 10Legoktm: build: Run commit-message-validator under Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/563116 [08:18:46] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Aroraakhil) Hi @Dzahn, I am currently traveling and the internet access is not that great. Thus, I will try to log in once I am back... [08:19:56] 10Operations, 10netops: Stale LibreNMS ports - https://phabricator.wikimedia.org/T242318 (10Marostegui) If you need the exact rows just do: `select * from...` instead of `select count(*) from...` Let me know if you need further help. [08:22:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1118', diff saved to https://phabricator.wikimedia.org/P10097 and previous config saved to /var/cache/conftool/dbconfig/20200109-082243-marostegui.json [08:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:28:15] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:28:41] (03CR) 10Gilles: [C: 04-1] admins: add Dave Pifke to perf-team admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562947 (https://phabricator.wikimedia.org/T242189) (owner: 10Dzahn) [08:33:46] ACKNOWLEDGEMENT - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 Ayounsi Working on it with JTAC - T240659 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:34:30] (03PS2) 10Gilles: admins: add Dave Pifke to perf-team admins [puppet] - 10https://gerrit.wikimedia.org/r/562947 (https://phabricator.wikimedia.org/T242189) (owner: 10Dzahn) [08:36:22] (03CR) 10Muehlenhoff: [C: 04-1] "That would cause problems for those roles which intentionally don't use base::firewall when DC ops do the base install before the actual r" [puppet] - 10https://gerrit.wikimedia.org/r/562856 (owner: 10Herron) [08:40:42] 10Operations, 10Traffic: Provide non-canonical-redirect from every datacenter - https://phabricator.wikimedia.org/T242321 (10Vgutierrez) [08:41:06] 10Operations, 10Traffic: Provide non-canonical-redirect service from every datacenter - https://phabricator.wikimedia.org/T242321 (10Vgutierrez) p:05Triage→03Normal [08:52:21] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/20285/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562787 (owner: 10Muehlenhoff) [08:52:53] (03PS2) 10Muehlenhoff: Switch rdb* to standardised Partman layout [puppet] - 10https://gerrit.wikimedia.org/r/562778 (https://phabricator.wikimedia.org/T156955) [08:55:59] (03PS1) 10Vgutierrez: Add ncredir300[12] DNS records [dns] - 10https://gerrit.wikimedia.org/r/563127 (https://phabricator.wikimedia.org/T242321) [08:56:01] (03CR) 10Muehlenhoff: [C: 03+2] Switch rdb* to standardised Partman layout [puppet] - 10https://gerrit.wikimedia.org/r/562778 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:57:56] (03CR) 10Filippo Giunchedi: "My understanding is that ceph will be a wmcs-specific service? If so any reason not to scrape from modules/profile/manifests/wmcs/promethe" [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [08:58:35] (03CR) 10Muehlenhoff: "theemin isn't really used for anything, the last time it was reimaged it was for Chris' tests of the fixed Grub setup on multiple disks." [puppet] - 10https://gerrit.wikimedia.org/r/561852 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:00:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I'm interested in serviceops' folks vote too" [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [09:11:45] (03PS1) 10Vgutierrez: ganeti: Add esams, ulsfo and eqsin clusters and rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 [09:16:07] (03CR) 10jerkins-bot: [V: 04-1] ganeti: Add esams, ulsfo and eqsin clusters and rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 (owner: 10Vgutierrez) [09:31:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1118', diff saved to https://phabricator.wikimedia.org/P10098 and previous config saved to /var/cache/conftool/dbconfig/20200109-093119-marostegui.json [09:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:10] !log Deploy schema change on db1106, this will generate a bit of lag on s1 labs [09:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:26] (03CR) 10Addshore: [C: 03+1] Set useEntitySourceBasedFederation to true for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562578 (https://phabricator.wikimedia.org/T241972) (owner: 10Ladsgroup) [09:33:31] (03CR) 10Addshore: [C: 03+1] Set wmgUseEntitySourceBasedFederation for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562504 (https://phabricator.wikimedia.org/T241973) (owner: 10Ladsgroup) [09:39:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1118', diff saved to https://phabricator.wikimedia.org/P10099 and previous config saved to /var/cache/conftool/dbconfig/20200109-093946-marostegui.json [09:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 for upgrade', diff saved to https://phabricator.wikimedia.org/P10100 and previous config saved to /var/cache/conftool/dbconfig/20200109-094748-marostegui.json [09:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:51] !log Upgrade db1106 [09:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:15] (03PS2) 10Vgutierrez: ganeti: Add esams, ulsfo and eqsin clusters and rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 [09:49:18] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:52:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1106', diff saved to https://phabricator.wikimedia.org/P10101 and previous config saved to /var/cache/conftool/dbconfig/20200109-095249-marostegui.json [09:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:48] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [09:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1106', diff saved to https://phabricator.wikimedia.org/P10102 and previous config saved to /var/cache/conftool/dbconfig/20200109-095433-marostegui.json [09:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:56] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:37] (03PS1) 10Muehlenhoff: Bump CAS to 6.1.3 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/563140 [09:59:17] (03CR) 10jerkins-bot: [V: 04-1] ganeti: Add esams, ulsfo and eqsin clusters and rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 (owner: 10Vgutierrez) [10:05:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1106', diff saved to https://phabricator.wikimedia.org/P10103 and previous config saved to /var/cache/conftool/dbconfig/20200109-100552-marostegui.json [10:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:17] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201911), 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10zeljkofilipin) 05Resolved→03Open [10:09:05] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201911), 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10zeljkofilipin) Looks like the list is not closed: https://lists.wikimedia.org/pipermail/qa/ [10:09:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1106', diff saved to https://phabricator.wikimedia.org/P10104 and previous config saved to /var/cache/conftool/dbconfig/20200109-100948-marostegui.json [10:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:38] (03CR) 10Ema: [C: 03+1] Add ncredir300[12] DNS records [dns] - 10https://gerrit.wikimedia.org/r/563127 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [10:16:33] (03PS3) 10Vgutierrez: ganeti: Add esams, ulsfo and eqsin clusters and rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 [10:21:16] (03CR) 10jerkins-bot: [V: 04-1] ganeti: Add esams, ulsfo and eqsin clusters and rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 (owner: 10Vgutierrez) [10:23:48] (03PS6) 10Jbond: apt:::pin: allow callers to override the notify resource [puppet] - 10https://gerrit.wikimedia.org/r/562789 [10:25:20] (03PS1) 10Elukey: Revert "hue: add row limit threshold for hive queries" [puppet] - 10https://gerrit.wikimedia.org/r/563144 [10:25:33] (03PS2) 10Elukey: Revert "hue: add row limit threshold for hive queries" [puppet] - 10https://gerrit.wikimedia.org/r/563144 [10:27:26] (03CR) 10Elukey: [C: 03+2] Revert "hue: add row limit threshold for hive queries" [puppet] - 10https://gerrit.wikimedia.org/r/563144 (owner: 10Elukey) [10:28:17] (03CR) 10Jbond: [C: 03+2] apt:::pin: allow callers to override the notify resource [puppet] - 10https://gerrit.wikimedia.org/r/562789 (owner: 10Jbond) [10:29:36] 10Operations, 10Performance-Team, 10Traffic, 10observability: Ensure graphs used by Performance account for Varnish-to-ATS migration - https://phabricator.wikimedia.org/T233474 (10ema) >>! In T233474#5786575, @Krinkle wrote: > @ema If I understand correctly, varnishrls does not yet require migration becaus... [10:32:11] (03PS1) 10Muehlenhoff: Fix coredump_dir for stretch/buster [puppet] - 10https://gerrit.wikimedia.org/r/563148 (https://phabricator.wikimedia.org/T224551) [10:35:23] (03PS1) 10Elukey: admin: add kerberos flag to user jfishback [puppet] - 10https://gerrit.wikimedia.org/r/563149 (https://phabricator.wikimedia.org/T242245) [10:35:28] (03CR) 10Vgutierrez: [C: 03+2] Add ncredir300[12] DNS records [dns] - 10https://gerrit.wikimedia.org/r/563127 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [10:38:56] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag to user jfishback [puppet] - 10https://gerrit.wikimedia.org/r/563149 (https://phabricator.wikimedia.org/T242245) (owner: 10Elukey) [10:41:24] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/562947 (https://phabricator.wikimedia.org/T242189) (owner: 10Dzahn) [10:43:39] (03CR) 10Jbond: "Looks like Nuria still hasn't approved this one" [puppet] - 10https://gerrit.wikimedia.org/r/562940 (https://phabricator.wikimedia.org/T241838) (owner: 10Dzahn) [10:44:11] (03CR) 10Jbond: [C: 03+1] "> Patch Set 2: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/562947 (https://phabricator.wikimedia.org/T242189) (owner: 10Dzahn) [10:50:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata/labs: url_downloader settings for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/563016 (owner: 10Dzahn) [10:50:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] profile::url_downloader: Add types and switch to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [10:52:17] (03PS1) 10Alexandros Kosiaris: Revert "otrs: Add otrs-admins group" [puppet] - 10https://gerrit.wikimedia.org/r/563151 (https://phabricator.wikimedia.org/T242113) [10:52:19] (03CR) 10Volans: [C: 03+1] Don't install the Postgres contrib package on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562787 (owner: 10Muehlenhoff) [10:52:38] (03CR) 10jerkins-bot: [V: 04-1] Revert "otrs: Add otrs-admins group" [puppet] - 10https://gerrit.wikimedia.org/r/563151 (https://phabricator.wikimedia.org/T242113) (owner: 10Alexandros Kosiaris) [10:53:21] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan) SSH key: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICIFfF8+3TrSBaPBKPwbmnBM7e0C9/TFHs9/2hHiq+3t no... [10:53:46] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan) [10:54:01] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan) [10:57:36] (03CR) 10Muehlenhoff: [C: 03+2] Fix coredump_dir for stretch/buster [puppet] - 10https://gerrit.wikimedia.org/r/563148 (https://phabricator.wikimedia.org/T224551) (owner: 10Muehlenhoff) [10:59:03] (03PS4) 10Ammarpad: Re-add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557584 (https://phabricator.wikimedia.org/T233104) [11:03:05] (03PS5) 10Ammarpad: Re-add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557584 (https://phabricator.wikimedia.org/T233104) [11:03:19] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10hnowlan) [11:04:42] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:04:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:03] (03PS1) 10Muehlenhoff: Apply url downloader role to new hosts [puppet] - 10https://gerrit.wikimedia.org/r/563154 (https://phabricator.wikimedia.org/T224551) [11:13:53] (03CR) 10Muehlenhoff: [C: 03+2] Apply url downloader role to new hosts [puppet] - 10https://gerrit.wikimedia.org/r/563154 (https://phabricator.wikimedia.org/T224551) (owner: 10Muehlenhoff) [11:19:28] (03PS1) 10Elukey: admin: add kerberos flag for user mepps [puppet] - 10https://gerrit.wikimedia.org/r/563158 (https://phabricator.wikimedia.org/T242222) [11:19:30] (03PS1) 10Elukey: admin: add kerberos flag for user snowick [puppet] - 10https://gerrit.wikimedia.org/r/563159 (https://phabricator.wikimedia.org/T242046) [11:22:05] (03PS2) 10Elukey: admin: add kerberos flag for user mepps [puppet] - 10https://gerrit.wikimedia.org/r/563158 (https://phabricator.wikimedia.org/T242222) [11:22:07] (03PS2) 10Elukey: admin: add kerberos flag for user snowick [puppet] - 10https://gerrit.wikimedia.org/r/563159 (https://phabricator.wikimedia.org/T242046) [11:23:58] !log installing cyrus-sasl security updates [11:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:49] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag for user mepps [puppet] - 10https://gerrit.wikimedia.org/r/563158 (https://phabricator.wikimedia.org/T242222) (owner: 10Elukey) [11:25:52] (03PS2) 10Alexandros Kosiaris: Revert "otrs: Add otrs-admins group" [puppet] - 10https://gerrit.wikimedia.org/r/563151 (https://phabricator.wikimedia.org/T242113) [11:26:07] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag for user snowick [puppet] - 10https://gerrit.wikimedia.org/r/563159 (https://phabricator.wikimedia.org/T242046) (owner: 10Elukey) [11:35:12] (03PS3) 10Alexandros Kosiaris: Revert "otrs: Add otrs-admins group" [puppet] - 10https://gerrit.wikimedia.org/r/563151 (https://phabricator.wikimedia.org/T242113) [11:35:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "otrs: Add otrs-admins group" [puppet] - 10https://gerrit.wikimedia.org/r/563151 (https://phabricator.wikimedia.org/T242113) (owner: 10Alexandros Kosiaris) [11:35:51] 10Operations, 10Patch-For-Review: Migrate URL downloaders to Buster - https://phabricator.wikimedia.org/T224551 (10MoritzMuehlenhoff) urldownloader* have been installed and are working fine in my tests; the only remaining (will do that on Monday) is to switch the CNAMEs and later remove the old jessie instances. [11:41:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Security: Please grant dsharpe temporary access to mendelevium.eqiad.wmnet - https://phabricator.wikimedia.org/T242113 (10akosiaris) 05Open→03Resolved a:03akosiaris Access removed. Marking as resolved. [11:49:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM. The icinga checks look fine and I agree on the $ARG3$ thing. They shouldn't page even on failure anyway. Fine by me on the prometheu" [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [11:55:05] (03PS1) 10Vgutierrez: install_server,ncredir: Install ncredir3[00]12 [puppet] - 10https://gerrit.wikimedia.org/r/563161 (https://phabricator.wikimedia.org/T242321) [11:58:02] jouncebot: next [11:58:02] In 0 hour(s) and 1 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200109T1200) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200109T1200). [12:00:04] Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:16] * Urbanecm around if needed [12:00:56] o/ [12:02:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Just to be on the safe side I double checked that it isn't used as well. The only usage is at" [puppet] - 10https://gerrit.wikimedia.org/r/562845 (https://phabricator.wikimedia.org/T241756) (owner: 10Clarakosi) [12:05:29] I guess I can SWAT [12:05:59] oh it's only my patch, nice [12:06:25] (03CR) 10Ladsgroup: [C: 03+2] Set wmgUseEntitySourceBasedFederation for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562504 (https://phabricator.wikimedia.org/T241973) (owner: 10Ladsgroup) [12:06:33] (03PS2) 10Ladsgroup: Set wmgUseEntitySourceBasedFederation for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562504 (https://phabricator.wikimedia.org/T241973) [12:06:50] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562504 (https://phabricator.wikimedia.org/T241973) (owner: 10Ladsgroup) [12:07:18] Amir1: ping me once you're done, please [12:07:31] Sure [12:08:23] (03Merged) 10jenkins-bot: Set wmgUseEntitySourceBasedFederation for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562504 (https://phabricator.wikimedia.org/T241973) (owner: 10Ladsgroup) [12:11:16] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:562504|Set wmgUseEntitySourceBasedFederation for test.wikidata.org (T241973)]] (duration: 01m 07s) [12:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:18] T241973: wmgUseEntitySourceBasedFederation true for test.wikidata.org - https://phabricator.wikimedia.org/T241973 [12:11:35] Urbanecm: I'm done [12:11:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Shepherd the change to restbase1018 already and manually restarted restbase on that host (puppet won't do that). I see no problems." [puppet] - 10https://gerrit.wikimedia.org/r/562845 (https://phabricator.wikimedia.org/T241756) (owner: 10Clarakosi) [12:12:01] thanks [12:12:38] (03CR) 10Muehlenhoff: [C: 03+2] Extend Netbox Ganeti sync for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/562780 (https://phabricator.wikimedia.org/T228099) (owner: 10Muehlenhoff) [12:13:57] (03PS2) 10ArielGlenn: fix up temp stub generation for cases where we rerun part of a part [dumps] - 10https://gerrit.wikimedia.org/r/562995 (https://phabricator.wikimedia.org/T242209) [12:13:59] (03PS3) 10Urbanecm: Add ipblock-exempt and extendedconfirmed to bot group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562040 (https://phabricator.wikimedia.org/T241904) (owner: 10Ammarpad) [12:14:04] (03CR) 10Urbanecm: [C: 03+2] Add ipblock-exempt and extendedconfirmed to bot group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562040 (https://phabricator.wikimedia.org/T241904) (owner: 10Ammarpad) [12:14:15] (03CR) 10jerkins-bot: [V: 04-1] fix up temp stub generation for cases where we rerun part of a part [dumps] - 10https://gerrit.wikimedia.org/r/562995 (https://phabricator.wikimedia.org/T242209) (owner: 10ArielGlenn) [12:14:27] (03PS4) 10Urbanecm: Set $wgArticleCountMethod to 'any' for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561572 (https://phabricator.wikimedia.org/T241694) (owner: 10Ammarpad) [12:14:59] (03Merged) 10jenkins-bot: Add ipblock-exempt and extendedconfirmed to bot group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562040 (https://phabricator.wikimedia.org/T241904) (owner: 10Ammarpad) [12:15:42] (03CR) 10Urbanecm: [C: 03+2] Set $wgArticleCountMethod to 'any' for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561572 (https://phabricator.wikimedia.org/T241694) (owner: 10Ammarpad) [12:16:34] (03PS5) 10Urbanecm: Set $wgArticleCountMethod to 'any' for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561572 (https://phabricator.wikimedia.org/T241694) (owner: 10Ammarpad) [12:16:41] (03CR) 10Urbanecm: [C: 03+2] Set $wgArticleCountMethod to 'any' for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561572 (https://phabricator.wikimedia.org/T241694) (owner: 10Ammarpad) [12:16:58] (03PS3) 10ArielGlenn: fix up temp stub generation for cases where we rerun part of a part [dumps] - 10https://gerrit.wikimedia.org/r/562995 (https://phabricator.wikimedia.org/T242209) [12:17:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 06394ea: Add ipblock-exempt and extendedconfirmed to bot group on fawiki (T241904) (duration: 01m 05s) [12:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:06] T241904: Add ipblock-exempt and extendedconfirmed to fawiki bot user group - https://phabricator.wikimedia.org/T241904 [12:17:31] (03CR) 10jerkins-bot: [V: 04-1] fix up temp stub generation for cases where we rerun part of a part [dumps] - 10https://gerrit.wikimedia.org/r/562995 (https://phabricator.wikimedia.org/T242209) (owner: 10ArielGlenn) [12:19:20] (03Merged) 10jenkins-bot: Set $wgArticleCountMethod to 'any' for minwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561572 (https://phabricator.wikimedia.org/T241694) (owner: 10Ammarpad) [12:19:46] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: ed0357a: Set $wgArticleCountMethod to any for minwiktionary (T241694) (duration: 01m 08s) [12:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:49] T241694: lemma count on Wiktionary Minangkabau - https://phabricator.wikimedia.org/T241694 [12:19:57] (03PS4) 10ArielGlenn: fix up temp stub generation for cases where we rerun part of a part [dumps] - 10https://gerrit.wikimedia.org/r/562995 (https://phabricator.wikimedia.org/T242209) [12:22:48] !log EU SWAT done [12:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:41] !log shutting down backup2001 T240177 [12:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:44] T240177: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 [12:26:27] 10Operations, 10ops-codfw: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10jijiki) [12:26:29] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10jijiki) [12:26:45] 10Operations, 10ops-codfw, 10serviceops: (Need By: Jan 15) rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T241796 (10jijiki) [12:26:48] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 (10jijiki) [12:28:13] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10jcrespo) backup2001 is now down and ready to be done maintenance (no need to ask again). @papaul please, when done, just boot it back up and ping here. Thanks. [12:31:39] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) Hey @Jclark-ctr ! Do we have an update about when we will have those servers ready? Thank you! [12:41:43] !log Deploy schema change on s3 codfw, lag will appear on s3 codfw - T234052 [12:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:47] T234052: Add abuse_filter_log.afl_filter_id and afl_global columns - https://phabricator.wikimedia.org/T234052 [12:55:22] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T241534 (10fgiunchedi) 05Open→03Resolved Thanks @Papaul ! Upon reboot the host booted into pxe, I am assuming because the first disk was present but was unbootable and didn't fallback onto booti... [13:23:37] (03PS7) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [13:25:02] (03PS1) 10Ema: varnish: raise severity of child restart to critical [puppet] - 10https://gerrit.wikimedia.org/r/563174 [13:25:47] (03CR) 10jerkins-bot: [V: 04-1] ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [13:27:02] (03PS2) 10Ema: varnish: raise severity of child restart to critical [puppet] - 10https://gerrit.wikimedia.org/r/563174 [13:34:06] (03PS8) 10Muehlenhoff: Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 [13:38:55] (03CR) 10Muehlenhoff: [C: 03+2] Don't install the Postgres contrib package on Buster [puppet] - 10https://gerrit.wikimedia.org/r/562787 (owner: 10Muehlenhoff) [13:46:03] (03PS8) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [13:46:45] (03PS9) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [13:47:50] !log upgrading mwdebug2002 to PHP 7.2.26 [13:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:14] !log upgrading mwdebug2002 to PHP 7.2.26 T241224 [13:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:17] T241224: Create PHP 7.2.26 Wikimedia package - https://phabricator.wikimedia.org/T241224 [13:48:51] (03CR) 10jerkins-bot: [V: 04-1] ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [13:50:22] (03PS10) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [13:51:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/563140 (owner: 10Muehlenhoff) [13:52:30] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (No Need By Date) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Gehel) [13:53:10] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [14:00:04] longma and liw: Dear deployers, time to do the Mediawiki train - American+European Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200109T1400). [14:04:03] !log imported PHP 7.2.26 to component/php72 for stretch-wikimedia [14:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:17] 10Operations, 10SRE-Access-Requests: Requesting access to Search Platform Service for Zbyszko Papierski - https://phabricator.wikimedia.org/T242341 (10Zbyszko) [14:07:45] (03CR) 10Elukey: [C: 03+1] "Moritz: ok to merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/561882 (owner: 10Muehlenhoff) [14:09:32] 10Operations, 10SRE-Access-Requests: Requesting access to Search Platform Service for Zbyszko Papierski - https://phabricator.wikimedia.org/T242341 (10Gehel) As the hiring manager for @Zbyszko, I approve this request. Since this also requests access to the analytics cluster, I'd like @Nuria to approve as well... [14:10:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078', diff saved to https://phabricator.wikimedia.org/P10105 and previous config saved to /var/cache/conftool/dbconfig/20200109-141057-marostegui.json [14:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:47] (03CR) 10Muehlenhoff: "Yes, this is good to merge. There were some followup fixes to apt::package_from_component, here's the updated PCC: https://puppet-compiler" [puppet] - 10https://gerrit.wikimedia.org/r/561882 (owner: 10Muehlenhoff) [14:17:40] (03CR) 10Elukey: [C: 03+2] amd_rocm: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/561882 (owner: 10Muehlenhoff) [14:18:25] 10Operations, 10ops-eqiad, 10DBA: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [14:19:44] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Marostegui) @Papaul - reminder the RAID10 is done with 256K (the reminder is just because we recently found old some servers with RAID stripe being set... [14:20:40] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Marostegui) [14:27:23] !log cp3054: varnish-frontend-restart to clear things up after child crash yesterday [14:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:42] !log Upgrade db1078 [14:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:21] (03CR) 10CDanis: [C: 03+1] varnish: raise severity of child restart to critical [puppet] - 10https://gerrit.wikimedia.org/r/563174 (owner: 10Ema) [14:33:47] (03CR) 10Ema: [C: 03+2] varnish: raise severity of child restart to critical [puppet] - 10https://gerrit.wikimedia.org/r/563174 (owner: 10Ema) [14:38:24] !log upgrading Firmware on backup2001 [14:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:16] (03CR) 10Jhedden: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [14:39:20] (03Abandoned) 10Jhedden: ceph: add prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/562979 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [14:39:25] PROBLEM - Varnish frontend child restarted on cp3050 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3050&var-datasource=esams+prometheus/ops [14:40:02] (03CR) 10Muehlenhoff: "Looks good, a few comments inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [14:42:49] PROBLEM - Varnish frontend child restarted on cp4031 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4031&var-datasource=ulsfo+prometheus/ops [14:43:25] PROBLEM - Varnish frontend child restarted on cp2014 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2014&var-datasource=codfw+prometheus/ops [14:44:01] ema: all expected --^ ? [14:44:05] (just to be sure) [14:44:39] nope, what is (null) supposed to mean there? [14:46:51] godog: any idea how to debug the above? [14:46:56] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install kubernetes10[08-15].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10jijiki) [14:47:08] godog: I've moved the threshold from 3 to 1 with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/563174/ [14:48:55] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install kubernetes10[08-15].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10jijiki) [14:52:01] perhaps warning and critical cannot be the same value? [14:52:25] (03CR) 10Ottomata: [C: 03+1] Add port 4192 to term eventgate-analytics in analytics-in4 [homer/public] - 10https://gerrit.wikimedia.org/r/562842 (https://phabricator.wikimedia.org/T242224) (owner: 10Elukey) [14:53:54] ema: yeah a bug/limitation in check_prometheus_metric indeed, 'method' has to be true for warning and critical [14:54:02] (03PS7) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) [14:54:19] godog: so perhaps setting warning to 0.9 would fix this? [14:54:26] (03PS1) 10Papaul: DHCP: Add MAC address entries for mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/563184 (https://phabricator.wikimedia.org/T239249) [14:54:38] sorry, 1.1 [14:55:15] (03PS1) 10Alexandros Kosiaris: nodejs10: Add buster image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) [14:55:30] ema: warning at 0.9 would do it yeah [14:58:23] PROBLEM - Varnish frontend child restarted on cp2006 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2006&var-datasource=codfw+prometheus/ops [14:58:25] PROBLEM - Varnish frontend child restarted on cp3059 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3059&var-datasource=esams+prometheus/ops [14:58:58] godog: thing is, the metric is 1 if everything is fine [14:59:12] godog: I want it to be critical if it's > 1, but no warning if it is == 1 [15:00:48] ema: ah, then I think setting method => 'ge' and warning = 2 critical = 2 will DTRT [15:00:58] PROBLEM - Varnish frontend child restarted on cp1078 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1078&var-datasource=eqiad+prometheus/ops [15:01:08] (03CR) 10CDanis: [C: 03+2] fastnetmon: remove UDP and ICMP limits [puppet] - 10https://gerrit.wikimedia.org/r/562387 (https://phabricator.wikimedia.org/T241374) (owner: 10CDanis) [15:01:36] (03PS8) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) [15:01:36] (03PS1) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [15:02:49] (03PS1) 10Ema: varnish: fix varnish_mgt_child_start prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/563188 [15:02:54] godog: something like the above? ^ [15:03:27] PROBLEM - Varnish frontend child restarted on cp2019 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2019&var-datasource=codfw+prometheus/ops [15:03:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1078', diff saved to https://phabricator.wikimedia.org/P10107 and previous config saved to /var/cache/conftool/dbconfig/20200109-150333-marostegui.json [15:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:27] (03CR) 10Filippo Giunchedi: [C: 03+1] varnish: fix varnish_mgt_child_start prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/563188 (owner: 10Ema) [15:05:30] ema: yeah! [15:05:33] cheers [15:05:40] (03CR) 10Ema: [C: 03+2] varnish: fix varnish_mgt_child_start prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/563188 (owner: 10Ema) [15:06:03] PROBLEM - Varnish frontend child restarted on cp4029 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4029&var-datasource=ulsfo+prometheus/ops [15:06:23] PROBLEM - Varnish frontend child restarted on cp2016 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2016&var-datasource=codfw+prometheus/ops [15:06:25] PROBLEM - Varnish frontend child restarted on cp5001 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp5001&var-datasource=eqsin+prometheus/ops [15:06:51] (03PS11) 10Jbond: ldap - idp: add ldap helper script for enabling u2f on cas [puppet] - 10https://gerrit.wikimedia.org/r/562852 [15:07:44] (03CR) 10Jbond: "Thanks all fixed" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [15:08:31] oof sigh [15:08:34] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [15:09:03] (03PS1) 10Jhedden: ceph: add wmcs prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/563190 (https://phabricator.wikimedia.org/T240715) [15:09:16] RECOVERY - Varnish frontend child restarted on cp2016 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2016&var-datasource=codfw+prometheus/ops [15:10:18] PROBLEM - Varnish frontend child restarted on cp4024 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4024&var-datasource=ulsfo+prometheus/ops [15:11:39] 10Operations, 10netops, 10Patch-For-Review: fastnetmon misreports attack type and protocol - https://phabricator.wikimedia.org/T241374 (10CDanis) 05Open→03Stalled Believe this has been worked around for now. [15:11:40] PROBLEM - Varnish frontend child restarted on cp1076 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1076&var-datasource=eqiad+prometheus/ops [15:11:43] 10Operations, 10DBA: backup2001 crashed 2019-12-08 - https://phabricator.wikimedia.org/T240177 (10Papaul) 05Open→03Resolved Before ` BIOS Version 1.3.7 iDRAC Firmware Version 3.34.34.34 ` After ` BIOS Version 2.4.8 iDRAC Firmware Version 4.00.00.00 FW upgrade complete. We can resolve this task and re-open... [15:12:13] FWIW I think in a reoccurence of the incident from yesterday we'd be adding more irc spam with the per-host alerts [15:13:28] PROBLEM - Varnish frontend child restarted on cp4030 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4030&var-datasource=ulsfo+prometheus/ops [15:13:42] (03PS2) 10Jhedden: ceph: add wmcs prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/563190 (https://phabricator.wikimedia.org/T240715) [15:13:46] PROBLEM - Varnish frontend child restarted on cp4023 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4023&var-datasource=ulsfo+prometheus/ops [15:14:14] PROBLEM - Varnish frontend child restarted on cp2020 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2020&var-datasource=codfw+prometheus/ops [15:14:14] PROBLEM - Varnish frontend child restarted on cp2007 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2007&var-datasource=codfw+prometheus/ops [15:14:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1078', diff saved to https://phabricator.wikimedia.org/P10108 and previous config saved to /var/cache/conftool/dbconfig/20200109-151434-marostegui.json [15:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:16] 10Operations, 10Wikimedia-Mailing-lists: Please create a private mailing list: sectrainings - https://phabricator.wikimedia.org/T242343 (10ssingh) [15:15:25] RECOVERY - Varnish frontend child restarted on cp1078 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1078&var-datasource=eqiad+prometheus/ops [15:15:35] PROBLEM - Varnish frontend child restarted on cp1082 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1082&var-datasource=eqiad+prometheus/ops [15:15:41] PROBLEM - Varnish frontend child restarted on cp4032 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4032&var-datasource=ulsfo+prometheus/ops [15:15:43] RECOVERY - Varnish frontend child restarted on cp4029 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4029&var-datasource=ulsfo+prometheus/ops [15:15:53] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10jijiki) [15:16:08] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10jijiki) [15:16:09] RECOVERY - Varnish frontend child restarted on cp2019 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2019&var-datasource=codfw+prometheus/ops [15:17:54] RECOVERY - Varnish frontend child restarted on cp4032 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4032&var-datasource=ulsfo+prometheus/ops [15:18:08] RECOVERY - Varnish frontend child restarted on cp5001 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp5001&var-datasource=eqsin+prometheus/ops [15:18:12] RECOVERY - Varnish frontend child restarted on cp4024 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4024&var-datasource=ulsfo+prometheus/ops [15:18:45] godog: ? [15:18:58] we had crashes on 4 hosts yesterday [15:19:40] RECOVERY - Varnish frontend child restarted on cp2006 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2006&var-datasource=codfw+prometheus/ops [15:21:32] RECOVERY - Varnish frontend child restarted on cp1076 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1076&var-datasource=eqiad+prometheus/ops [15:21:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1078', diff saved to https://phabricator.wikimedia.org/P10109 and previous config saved to /var/cache/conftool/dbconfig/20200109-152157-marostegui.json [15:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:19] (03PS3) 10Jhedden: ceph: add wmcs prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/563190 (https://phabricator.wikimedia.org/T240715) [15:22:37] yeah in the middle of a whole lot of other alerts in that case, and icinga already quit for excess flood IIRC, anyways better critical than warning for sure [15:24:02] PROBLEM - Varnish frontend child restarted on cp5009 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp5009&var-datasource=eqsin+prometheus/ops [15:25:14] PROBLEM - Varnish frontend child restarted on cp3060 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3060&var-datasource=esams+prometheus/ops [15:25:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/562852 (owner: 10Jbond) [15:25:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1078', diff saved to https://phabricator.wikimedia.org/P10110 and previous config saved to /var/cache/conftool/dbconfig/20200109-152545-marostegui.json [15:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:59] (03CR) 10Muehlenhoff: "Updated PCC, there were a few followup commits to apt:package_from_component: https://puppet-compiler.wmflabs.org/compiler1003/20295/" [puppet] - 10https://gerrit.wikimedia.org/r/561855 (owner: 10Muehlenhoff) [15:31:00] (03PS4) 10Muehlenhoff: grafana: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/561855 [15:31:42] PROBLEM - Varnish frontend child restarted on cp5010 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp5010&var-datasource=eqsin+prometheus/ops [15:33:00] RECOVERY - Varnish frontend child restarted on cp3059 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3059&var-datasource=esams+prometheus/ops [15:33:23] (03CR) 10CDanis: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/561855 (owner: 10Muehlenhoff) [15:34:10] PROBLEM - Varnish frontend child restarted on cp2026 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2026&var-datasource=codfw+prometheus/ops [15:36:21] RECOVERY - Varnish frontend child restarted on cp4031 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4031&var-datasource=ulsfo+prometheus/ops [15:36:47] RECOVERY - Varnish frontend child restarted on cp4030 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4030&var-datasource=ulsfo+prometheus/ops [15:37:03] RECOVERY - Varnish frontend child restarted on cp2026 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2026&var-datasource=codfw+prometheus/ops [15:37:37] RECOVERY - Varnish frontend child restarted on cp1082 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1082&var-datasource=eqiad+prometheus/ops [15:37:37] RECOVERY - Varnish frontend child restarted on cp2020 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2020&var-datasource=codfw+prometheus/ops [15:37:37] RECOVERY - Varnish frontend child restarted on cp2007 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2007&var-datasource=codfw+prometheus/ops [15:37:37] RECOVERY - Varnish frontend child restarted on cp3060 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3060&var-datasource=esams+prometheus/ops [15:37:37] RECOVERY - Varnish frontend child restarted on cp4023 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4023&var-datasource=ulsfo+prometheus/ops [15:37:39] RECOVERY - Varnish frontend child restarted on cp5010 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp5010&var-datasource=eqsin+prometheus/ops [15:37:39] RECOVERY - Varnish frontend child restarted on cp5009 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp5009&var-datasource=eqsin+prometheus/ops [15:38:11] alright that should be it. Sorry for the spam! [15:38:19] RECOVERY - Varnish frontend child restarted on cp2014 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2014&var-datasource=codfw+prometheus/ops [15:38:52] 10Operations, 10Puppet, 10Patch-For-Review: puppet-merge can't accept an explicit SHA1 for an --ops merge - https://phabricator.wikimedia.org/T241277 (10CDanis) A simple option: if puppet-merge.sh is given a treeish, it *only* does the ops repo or the labsprivate repo (depending on what flag was passed). A... [15:40:31] PROBLEM - Varnish frontend child restarted on cp3058 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3058&var-datasource=esams+prometheus/ops [15:40:39] 👀 [15:40:50] ah yes, these are the ones from yesterday that are now critical [15:41:10] so the check finally works as intended :) [15:42:15] (03CR) 10Bstorm: kubernetes: persist the cpu and mem args in service manifests (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563003 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [15:42:39] (03CR) 10Ppchelko: "> Patch Set 1: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/562845 (https://phabricator.wikimedia.org/T241756) (owner: 10Clarakosi) [15:44:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging data for knissen - https://phabricator.wikimedia.org/T241838 (10kai.nissen) @Dzahn Alright, thanks! @Nuria No, I've never used it before, but I'll make myself familiar with it. There are also quite some people who can... [15:46:43] (03PS4) 10Jhedden: ceph: add wmcs prometheus scrape config [puppet] - 10https://gerrit.wikimedia.org/r/563190 (https://phabricator.wikimedia.org/T240715) [15:46:59] !log cp3058: varnish-frontend-restart to clear things up after child crash yesterday [15:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:17] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10jijiki) [15:50:03] RECOVERY - Varnish frontend child restarted on cp3058 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3058&var-datasource=esams+prometheus/ops [15:51:07] (03CR) 10Vgutierrez: [C: 03+2] install_server,ncredir: Install ncredir3[00]12 [puppet] - 10https://gerrit.wikimedia.org/r/563161 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [15:51:33] (03PS2) 10Vgutierrez: install_server,ncredir: Install ncredir3[00]12 [puppet] - 10https://gerrit.wikimedia.org/r/563161 (https://phabricator.wikimedia.org/T242321) [15:53:35] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10jijiki) [15:53:45] PROBLEM - Varnish frontend child restarted on cp3062 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3062&var-datasource=esams+prometheus/ops [15:56:37] (03PS32) 10Cwhite: lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [15:56:40] (03CR) 10Jhedden: [C: 03+2] "PCC results https://puppet-compiler.wmflabs.org/compiler1002/20297/" [puppet] - 10https://gerrit.wikimedia.org/r/563190 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [16:04:56] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Event-Platform, and 5 others: Public EventGate endpoint for analytics event intake - https://phabricator.wikimedia.org/T233629 (10Ottomata) [16:14:55] (03PS3) 10Dzahn: admins: add Dave Pifke to perf-team admins [puppet] - 10https://gerrit.wikimedia.org/r/562947 (https://phabricator.wikimedia.org/T242189) [16:16:21] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:16:28] (03CR) 10Dzahn: [C: 03+2] "thanks John, rebased and merging" [puppet] - 10https://gerrit.wikimedia.org/r/562947 (https://phabricator.wikimedia.org/T242189) (owner: 10Dzahn) [16:17:18] (03PS4) 10Dzahn: admins: add Dave Pifke to perf-team admins [puppet] - 10https://gerrit.wikimedia.org/r/562947 (https://phabricator.wikimedia.org/T242189) [16:17:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:18:56] (03CR) 10Elukey: [C: 04-1] DHCP: Add MAC address entries for mc-gp200[1-3] (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563184 (https://phabricator.wikimedia.org/T239249) (owner: 10Papaul) [16:19:29] (03CR) 10Muehlenhoff: [C: 03+2] grafana: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/561855 (owner: 10Muehlenhoff) [16:22:19] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201911), 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10Jdforrester-WMF) 05Open→03Resolved https://lists.wikimedia.org/mailman/listinfo/qa It's closed, but not deleted. [16:23:03] (03PS4) 10CDanis: puppet-merge.py: SHA1 or explicit FETCH_HEAD is mandatory [puppet] - 10https://gerrit.wikimedia.org/r/559944 (https://phabricator.wikimedia.org/T241277) [16:24:40] (03CR) 10CDanis: "thanks! ptal" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559944 (https://phabricator.wikimedia.org/T241277) (owner: 10CDanis) [16:25:55] (03CR) 10CDanis: [C: 03+1] "lgtm, feel free to use theemin if you need to do any testing ;)" [puppet] - 10https://gerrit.wikimedia.org/r/561852 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:26:40] (03CR) 10CDanis: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/562698 (owner: 10Ayounsi) [16:26:48] (03PS1) 10Effie Mouzeli: mtail: Remove hhvm handler [puppet] - 10https://gerrit.wikimedia.org/r/563206 [16:27:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers in perf-team group for dpifke - https://phabricator.wikimedia.org/T242189 (10Dzahn) 05Open→03Resolved Hi @dpifke @Gilles, Dave, you have been added to the "perf-team" group. I ran puppet on `webperf1001.eqia... [16:28:49] (03CR) 10jerkins-bot: [V: 04-1] mtail: Remove hhvm handler [puppet] - 10https://gerrit.wikimedia.org/r/563206 (owner: 10Effie Mouzeli) [16:29:06] (03PS1) 10Vgutierrez: Add ncredir-lb.esams.wikimedia.org DNS records [dns] - 10https://gerrit.wikimedia.org/r/563207 (https://phabricator.wikimedia.org/T242321) [16:32:12] (03PS1) 10Muehlenhoff: tor: switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563208 [16:33:00] (03CR) 10jerkins-bot: [V: 04-1] tor: switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563208 (owner: 10Muehlenhoff) [16:34:36] (03PS2) 10Muehlenhoff: tor: switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563208 [16:35:35] (03CR) 10jerkins-bot: [V: 04-1] tor: switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563208 (owner: 10Muehlenhoff) [16:36:22] (03PS3) 10Muehlenhoff: tor: switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563208 [16:37:05] (03PS1) 10Vgutierrez: lvs: Set realserver_ips on ncredir esams instances [puppet] - 10https://gerrit.wikimedia.org/r/563210 (https://phabricator.wikimedia.org/T242321) [16:39:23] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Event-Platform, and 5 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) p:05Normal→03High [16:39:28] (03PS1) 10Ottomata: [WIP] New eventgate-analytics-external instance using remote EventStreamConfig API [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) [16:46:28] (03PS1) 10Vgutierrez: lvs: Add ncredir esams configuration [puppet] - 10https://gerrit.wikimedia.org/r/563214 (https://phabricator.wikimedia.org/T242321) [16:55:26] 10Operations, 10Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: LDF server has 404 errors for JS and CSS resources - https://phabricator.wikimedia.org/T237165 (10Mstyles) for clarification the correct response will contain a list that looks like this ` @prefix schema: . @... [16:57:28] (03CR) 10BBlack: [C: 03+1] Add ncredir-lb.esams.wikimedia.org DNS records [dns] - 10https://gerrit.wikimedia.org/r/563207 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [16:57:30] (03CR) 10BBlack: [C: 03+1] lvs: Set realserver_ips on ncredir esams instances [puppet] - 10https://gerrit.wikimedia.org/r/563210 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [16:58:42] (03CR) 10Vgutierrez: [C: 03+2] Add ncredir-lb.esams.wikimedia.org DNS records [dns] - 10https://gerrit.wikimedia.org/r/563207 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [17:00:04] godog and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200109T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:30] (03CR) 10Vgutierrez: [C: 03+2] lvs: Set realserver_ips on ncredir esams instances [puppet] - 10https://gerrit.wikimedia.org/r/563210 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [17:01:13] <_joe_> vgutierrez: why aren't you using profile::lvs::realserver? [17:01:31] <_joe_> that automates everything for you once you declare which pools you want [17:01:49] ncredir predates that [17:01:52] <_joe_> even adds safe service restart scripts (that interact with etcd to depool/pool the service) [17:01:57] I have to update the puppetization :) [17:05:53] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:05:59] uh [17:06:14] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:06:17] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:06:27] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:07:44] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10wiki_willy) a:05Joe→03Jclark-ctr [17:08:00] ^ no announcements in maint-announce or calendar [17:08:57] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:09:08] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10wiki_willy) a:05Joe→03Jclark-ctr [17:09:21] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:09:33] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:09:39] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:11:09] PROBLEM - Varnish frontend child restarted on cp3054 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3054&var-datasource=esams+prometheus/ops [17:11:15] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:28:46] (03CR) 10Alexandros Kosiaris: Support canary functionality (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/469662 (owner: 10Alexandros Kosiaris) [17:30:42] mutante: the links involved in the alerts are the ones under review for https://phabricator.wikimedia.org/T240659, there seems to be a problem with BFD and knams-connected links [17:31:39] elukey: thanks. so it's the Level3 Wave Link between esams and eqiad. also see _security [17:32:22] mutante: those are GTT links no? [17:32:47] (03PS5) 10Alexandros Kosiaris: mathoid: Support canary functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/469662 [17:32:52] L3 joins cr2-esams with cr2-eqiad, they were not involved [17:32:56] elukey: hmm.. i just saw " Transport: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]; " [17:33:05] then they are different [17:33:30] ah I was checking the very last ones, didn't see that one [17:33:34] was it before? [17:34:14] elukey: ah..yea, they are alerts from about 1h 20m ago [17:34:37] still going on though [17:34:43] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr2-eqiad&service=Router+interfaces [17:35:07] lovely [17:36:23] vgutierrez: ^ so one part was probably https://phabricator.wikimedia.org/T240659 but the other part is unrelated [17:37:19] mutante: there is an email from L3 now in maintenance@ [17:37:43] so they are aware of it [17:37:53] !log confctl set/weight=10 for elastic10[53-67] - T242348 [17:37:54] (no need to follow up, I was about to send them an email) [17:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:56] T242348: Investigate low resource usage on elastic1061-67 - https://phabricator.wikimedia.org/T242348 [17:38:00] !log volans@cumin1001 conftool action : set/weight=10; selector: name=elastic105[3-9].eqiad.wmnet [17:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:03] elukey: ah, i had just checked that a few minutes ago. they they caught up. thanks [17:38:16] (03PS6) 10Alexandros Kosiaris: mathoid: Support canary functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/469662 [17:38:18] (03PS1) 10Alexandros Kosiaris: mathoid: Remove old externalIP config value [deployment-charts] - 10https://gerrit.wikimedia.org/r/563228 [17:38:34] mutante: yes just arrived, all good :) [17:38:52] !log volans@cumin1001 conftool action : set/weight=10; selector: name=elastic106.*.eqiad.wmnet [17:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: Remove old externalIP config value [deployment-charts] - 10https://gerrit.wikimedia.org/r/563228 (owner: 10Alexandros Kosiaris) [17:40:03] (03Merged) 10jenkins-bot: mathoid: Remove old externalIP config value [deployment-charts] - 10https://gerrit.wikimedia.org/r/563228 (owner: 10Alexandros Kosiaris) [17:52:50] (03PS2) 10Papaul: DHCP: Add MAC address entries for mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/563184 (https://phabricator.wikimedia.org/T239249) [17:59:06] (03PS1) 10Ottomata: Use new primary schema repo for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/563231 (https://phabricator.wikimedia.org/T240985) [17:59:33] (03CR) 10Ottomata: [C: 03+2] Use new primary schema repo for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/563231 (https://phabricator.wikimedia.org/T240985) (owner: 10Ottomata) [18:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200109T1800). [18:01:01] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:47] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:15] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:41] (03PS7) 10Alexandros Kosiaris: mathoid: Support canary functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/469662 [18:55:47] !log otto@deploy1001 Started deploy [analytics/hdfs-tools/deploy@f8e9d6f]: (no justification provided) [18:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:56] !log otto@deploy1001 Finished deploy [analytics/hdfs-tools/deploy@f8e9d6f]: (no justification provided) (duration: 00m 08s) [18:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200109T1900). [19:00:05] Jdlrobson: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:37] \o [19:04:51] Ï can SWAT today! [19:05:41] Jdlrobson: are you able to test your patches? [19:06:33] Urbanecm: yup [19:06:38] (03CR) 10Urbanecm: [C: 03+2] Drop beta setting. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562607 (https://phabricator.wikimedia.org/T237290) (owner: 10Jdlrobson) [19:07:39] (03Merged) 10jenkins-bot: Drop beta setting. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562607 (https://phabricator.wikimedia.org/T237290) (owner: 10Jdlrobson) [19:08:08] Jdlrobson: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/563226 is the maser patch, could you please do the cherrypicks? [19:08:59] Jdlrobson: please test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/562607 at mwdebug1001 and let me know [19:09:37] (03CR) 10Elukey: [C: 03+2] DHCP: Add MAC address entries for mc-gp200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/563184 (https://phabricator.wikimedia.org/T239249) (owner: 10Papaul) [19:09:43] (on it) [19:09:46] thank you [19:10:02] sorry https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MobileFrontend/+/563230 is the one you want [19:10:18] thanks [19:10:37] and config change is working as expected [19:10:42] confirmed on mwdebug1001 [19:10:42] syncing [19:12:20] Jdlrobson: waiting on CI now, I'll ping you when it's ready [19:13:04] !log urbanecm@deploy1001 Synchronized wmf-config/mobile.php: SWAT: 2f9ee90: Drop beta setting (T237290) (duration: 01m 06s) [19:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:08] T237290: Disable mobile beta mode (for now) - https://phabricator.wikimedia.org/T237290 [19:13:35] papaul: o/ mc-gp dhcp change merged and puppet just ran on install1002 and install2002 [19:23:33] merged! yay [19:24:22] (03CR) 10Jforrester: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562607 (https://phabricator.wikimedia.org/T237290) (owner: 10Jdlrobson) [19:25:29] Jdlrobson: mobilefrontend patch is available at mwdebug1001 [19:25:37] let me know if I should add the core patch as well [19:25:54] (I mean, if you test both at once, or one by one - I see there's one task for them) [19:26:52] (03CR) 10Jdlrobson: "> Do I need to make this a C-2 next time?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562607 (https://phabricator.wikimedia.org/T237290) (owner: 10Jdlrobson) [19:27:19] elukey: thanks [19:27:52] Urbanecm: core patch can go out at the same time or just after [19:28:19] Okay [19:28:26] let me know what you think about the MF patch first [19:29:34] Urbanecm: that change looks good to me and can be synced [19:29:36] thanks! [19:29:38] doing [19:30:07] Jdlrobson: is the patch order-sensitive? [19:30:11] (I mean, order of the files [19:32:42] (03CR) 10Jforrester: "> Patch Set 4:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562607 (https://phabricator.wikimedia.org/T237290) (owner: 10Jdlrobson) [19:33:41] Urbanecm: not really [19:33:47] Okay, thanks [19:33:52] Urbanecm: ideally it would be Mf first then core [19:33:54] but it doesn't matter [19:34:10] I meant, file order within the MF patch. [19:34:33] syncing normally, hopefully it'll work as expected [19:35:56] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.14/extensions/MobileFrontend/: SWAT: 31d3be7: Hot fixes for mobile diff page (T242310) (duration: 01m 09s) [19:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:59] T242310: Regression: issues with MobileDiff - https://phabricator.wikimedia.org/T242310 [19:36:05] Jdlrobson: here you are [19:36:39] thank you! [19:36:53] Jdlrobson: core is at mwdebug1001 [19:40:51] 10Operations, 10ops-codfw: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) [19:42:33] (03PS1) 10ZPapierski: admin: added user zpapierski [puppet] - 10https://gerrit.wikimedia.org/r/563261 (https://phabricator.wikimedia.org/T242341) [19:42:39] Jdlrobson: any issues? [19:42:53] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: LDF server has 404 errors for JS and CSS resources - https://phabricator.wikimedia.org/T237165 (10Gehel) a:03Mstyles [19:44:16] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Requesting access to Search Platform Service for Zbyszko Papierski - https://phabricator.wikimedia.org/T242341 (10Gehel) [19:45:15] Jdlrobson: ? [19:49:02] nope Urbanecm looks good! [19:49:08] thank you! [19:49:10] thanks! [19:50:32] 10Operations, 10ops-codfw: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) [19:51:00] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.14/resources/Resources.php: SWAT: 39bc331: Enable mediawiki.page.patrol.ajax on mobile (T242310) (duration: 01m 05s) [19:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:04] T242310: Regression: issues with MobileDiff - https://phabricator.wikimedia.org/T242310 [19:51:08] Jdlrobson: all done now! [19:51:18] !log Morning SWAT done [19:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] longma and liw: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200109T2000). [20:03:37] (03PS1) 10Jeena Huneidi: all wikis to 1.35.0-wmf.14 refs T233862 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563263 [20:03:40] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.35.0-wmf.14 refs T233862 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563263 (owner: 10Jeena Huneidi) [20:04:35] (03PS2) 10MSantos: WIP: Proton charts first draft [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) [20:04:38] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.14 refs T233862 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/563263 (owner: 10Jeena Huneidi) [20:06:17] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.14 refs T233862 [20:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:20] T233862: 1.35.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T233862 [20:06:42] 10Operations, 10ops-codfw: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) [20:08:43] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 83233056 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:10:31] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:16:37] (03CR) 10CDanis: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/561816 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [20:18:28] (03PS5) 10CDanis: puppet-merge.py: SHA1 or explicit FETCH_HEAD is mandatory [puppet] - 10https://gerrit.wikimedia.org/r/559944 (https://phabricator.wikimedia.org/T241277) [20:32:01] !log add phabtest2 to #security temp to ensure reporting settings (T240605) [20:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:05] T240605: Set Security Issue Task Type as default for Security reporting - https://phabricator.wikimedia.org/T240605 [20:35:41] 10Operations, 10ops-codfw: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) [20:36:08] 10Operations, 10ops-codfw: (Need By: Jan 15) codfw: rack/setup/install mc-gp200[123].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) a:05Papaul→03elukey @elukey all yours [20:43:02] 10Operations, 10observability: Make status.wikimedia.org a "status entry page"? - https://phabricator.wikimedia.org/T242367 (10Gestumblindi) [20:56:45] (03CR) 10Jbond: [C: 03+1] "lgtm but god this script is getting ugly, glad git blame wont just blame me ;)" [puppet] - 10https://gerrit.wikimedia.org/r/559944 (https://phabricator.wikimedia.org/T241277) (owner: 10CDanis) [20:57:45] (03CR) 10CDanis: [C: 04-1] stunnel: add stunnel module and update rsync to use it (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond) [21:07:49] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 86278328 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:11:09] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:16:27] (03CR) 10Cwhite: [C: 03+2] lvs, prometheus, profile: add blackbox job helper and enable openapi scrapes [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:24:32] (03PS4) 10Jbond: etcd: remove username/password [puppet] - 10https://gerrit.wikimedia.org/r/561819 (https://phabricator.wikimedia.org/T240941) [21:24:40] !log ebernhardson@deploy1001 Started deploy [search/airflow@746c149]: Add skein to airflow venv [21:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:34] !log ebernhardson@deploy1001 Finished deploy [search/airflow@746c149]: Add skein to airflow venv (duration: 00m 55s) [21:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:30:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:31:29] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Dzahn) Hi @Aroraakhil thank you. Yes, there is no rush from our side. You can try anytime. If needed you can click "Add Action" -> Ch... [21:31:47] 10Operations, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10BBlack) p:05Triage→03Normal [21:32:00] 10Operations, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10BBlack) [21:32:02] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10BBlack) [21:32:10] 10Operations, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10BBlack) [21:32:12] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10BBlack) 05Open→03Resolved [21:34:34] (03PS1) 10EBernhardson: airflow: Properly pass quoted cli arguments in wrapper [puppet] - 10https://gerrit.wikimedia.org/r/563280 [21:34:53] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Requesting access to Search Platform Service for Zbyszko Papierski - https://phabricator.wikimedia.org/T242341 (10Dzahn) >>! In T242341#5789679, @Gehel wrote: > Not sure who can check that the appropriate NDA are si... [21:37:47] (03CR) 10Jbond: stunnel: add stunnel module and update rsync to use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558133 (owner: 10Jbond) [21:38:31] (03PS1) 10BBlack: Fix WDQS LDF URI routing [puppet] - 10https://gerrit.wikimedia.org/r/563281 (https://phabricator.wikimedia.org/T237165) [21:41:29] (03CR) 10BBlack: [C: 03+2] Fix WDQS LDF URI routing [puppet] - 10https://gerrit.wikimedia.org/r/563281 (https://phabricator.wikimedia.org/T237165) (owner: 10BBlack) [21:49:00] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1100 - https://phabricator.wikimedia.org/T241506 (10Jclark-ctr) @Marostegui Drive has arrives Please PM me on IRC so we can get this swapped [21:52:15] (03PS1) 10Cwhite: lvs, monitoring, prometheus: bugfix openapi exports [puppet] - 10https://gerrit.wikimedia.org/r/563283 (https://phabricator.wikimedia.org/T205870) [21:53:24] (03PS2) 10Cwhite: lvs, monitoring, prometheus: bugfix openapi exports [puppet] - 10https://gerrit.wikimedia.org/r/563283 (https://phabricator.wikimedia.org/T205870) [21:54:03] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Requesting access to Search Platform Service for Zbyszko Papierski - https://phabricator.wikimedia.org/T242341 (10RStallman-legalteam) No extra NDA needed for full time WMF staff as the confidentiality agreement is... [21:56:01] (03CR) 10Bstorm: "This seems to have broken some cases of cloud VMs. It clearly has broken puppet on the tools-puppetmaster-01 VM (Jessie)." [puppet] - 10https://gerrit.wikimedia.org/r/562789 (owner: 10Jbond) [21:56:31] (03CR) 10Bstorm: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/562789 (owner: 10Jbond) [21:57:32] (03CR) 10Cwhite: [C: 03+2] lvs, monitoring, prometheus: bugfix openapi exports [puppet] - 10https://gerrit.wikimedia.org/r/563283 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [22:05:41] longma: Train all done? [22:06:03] Yes [22:06:41] Excellent. Congratulations. [22:07:46] :) [22:09:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_maps_eqsin,swagger_check_restbase_eqsin} site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:10:08] (03PS1) 10Dzahn: gerrit: make db_user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/563284 (https://phabricator.wikimedia.org/T239151) [22:12:16] (03CR) 10Paladox: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/563284 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [22:14:07] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/20300/" [puppet] - 10https://gerrit.wikimedia.org/r/563284 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [22:17:01] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: LDF server has 404 errors for JS and CSS resources - https://phabricator.wikimedia.org/T237165 (10Vahurzpu) It's consistently working for me now. Thanks! [22:17:24] (03CR) 10Dzahn: [C: 03+2] gerrit: make db_user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/563284 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [22:21:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1016 crash - https://phabricator.wikimedia.org/T241882 (10bd808) [22:21:53] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10bd808) [22:22:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1013: server down for no reason (power issue?) - https://phabricator.wikimedia.org/T241313 (10bd808) [22:22:35] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (No Need By Date Provided) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10bd808) [22:23:16] (03PS2) 10Dzahn: gerrit: make db_user configurable in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/563284 (https://phabricator.wikimedia.org/T239151) [22:25:15] (03CR) 10Bstorm: apt:::pin: allow callers to override the notify resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562789 (owner: 10Jbond) [22:35:32] 10Operations, 10Research, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10leila) [22:39:07] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:39:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:41:27] 10Operations, 10Research, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10leila) @bmansurov you can use this task for tracking and implementing the change for bringing the hosting of wikiworkshop.org to github. (For context: I had a chat wit... [22:42:55] (03CR) 10Bstorm: "Found it on other VMs with apt::pins." [puppet] - 10https://gerrit.wikimedia.org/r/562789 (owner: 10Jbond) [22:43:47] (03PS1) 10Bstorm: Revert "apt:::pin: allow callers to override the notify resource" [puppet] - 10https://gerrit.wikimedia.org/r/563296 [22:44:26] 10Operations, 10Research, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10Reedy) >>! In T242374#5791357, @leila wrote: > @bmansurov you can use this task for tracking and implementing the change for bringing the hosting of wikiworkshop.org to... [22:48:27] (03CR) 10Bstorm: "So far all broken puppet agent VMs are running puppet 4.8.2 (stretch and jessie). They always break with multiple definitions of resource" [puppet] - 10https://gerrit.wikimedia.org/r/563296 (owner: 10Bstorm) [22:51:46] (03CR) 10Jbond: [C: 03+1] "lgtm please create ticket showing the error which caused this revert" [puppet] - 10https://gerrit.wikimedia.org/r/563296 (owner: 10Bstorm) [22:57:20] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/563296 (owner: 10Bstorm) [22:57:38] (03PS2) 10Bstorm: Revert "apt:::pin: allow callers to override the notify resource" [puppet] - 10https://gerrit.wikimedia.org/r/563296 [23:00:42] (03CR) 10Bstorm: [C: 03+2] Revert "apt:::pin: allow callers to override the notify resource" [puppet] - 10https://gerrit.wikimedia.org/r/563296 (owner: 10Bstorm) [23:01:59] (03CR) 10Dzahn: [C: 03+2] hieradata/labs: url_downloader settings for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/563016 (owner: 10Dzahn) [23:02:09] (03PS3) 10Dzahn: hieradata/labs: url_downloader settings for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/563016 [23:13:35] bstorm_: thanks, you fixed puppet that i needed for unrealted stuff :) [23:13:53] y/w 😁 [23:14:10] (03PS1) 10Cwhite: monitoring, profile, prometheus: bugfix, prometheus params values [puppet] - 10https://gerrit.wikimedia.org/r/563301 (https://phabricator.wikimedia.org/T205870) [23:15:35] (03PS1) 10Dzahn: gerrit: use 'gerritro' readonly db user on test server [puppet] - 10https://gerrit.wikimedia.org/r/563302 (https://phabricator.wikimedia.org/T239151) [23:16:27] (03CR) 10jerkins-bot: [V: 04-1] monitoring, profile, prometheus: bugfix, prometheus params values [puppet] - 10https://gerrit.wikimedia.org/r/563301 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [23:19:26] (03CR) 10Dzahn: "ah, =thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [23:20:30] (03PS2) 10Dzahn: gerrit: use 'gerritro' readonly db user on test server [puppet] - 10https://gerrit.wikimedia.org/r/563302 (https://phabricator.wikimedia.org/T239151) [23:20:42] (03PS3) 10Dzahn: gerrit: use 'gerritro' readonly db user on test server [puppet] - 10https://gerrit.wikimedia.org/r/563302 (https://phabricator.wikimedia.org/T239151) [23:20:52] (03PS2) 10Cwhite: monitoring, profile, prometheus: bugfix, prometheus params values [puppet] - 10https://gerrit.wikimedia.org/r/563301 (https://phabricator.wikimedia.org/T205870) [23:21:11] (03PS4) 10Dzahn: gerrit: use 'gerritro' readonly db user on test server [puppet] - 10https://gerrit.wikimedia.org/r/563302 (https://phabricator.wikimedia.org/T239151) [23:22:31] (03PS3) 10Cwhite: monitoring, profile, prometheus: bugfix, prometheus params values [puppet] - 10https://gerrit.wikimedia.org/r/563301 (https://phabricator.wikimedia.org/T205870) [23:23:22] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) [23:25:40] (03CR) 10Cwhite: [C: 03+2] monitoring, profile, prometheus: bugfix, prometheus params values [puppet] - 10https://gerrit.wikimedia.org/r/563301 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [23:28:28] (03PS1) 10Cwhite: Revert "monitoring, profile, prometheus: bugfix, prometheus params values" [puppet] - 10https://gerrit.wikimedia.org/r/563304 [23:29:45] (03CR) 10Cwhite: [C: 03+2] Revert "monitoring, profile, prometheus: bugfix, prometheus params values" [puppet] - 10https://gerrit.wikimedia.org/r/563304 (owner: 10Cwhite) [23:31:15] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) [23:32:04] (03CR) 10Paladox: [C: 03+1] gerrit: use 'gerritro' readonly db user on test server [puppet] - 10https://gerrit.wikimedia.org/r/563302 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [23:32:57] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) [23:35:48] (03CR) 10Dzahn: [C: 03+2] gerrit: use 'gerritro' readonly db user on test server [puppet] - 10https://gerrit.wikimedia.org/r/563302 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [23:35:58] (03PS5) 10Dzahn: gerrit: use 'gerritro' readonly db user on test server [puppet] - 10https://gerrit.wikimedia.org/r/563302 (https://phabricator.wikimedia.org/T239151) [23:37:46] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) [23:38:34] (03PS1) 10Cwhite: lvs, monitoring: prometheus expects string[] type as value of params [puppet] - 10https://gerrit.wikimedia.org/r/563306 (https://phabricator.wikimedia.org/T205870) [23:40:49] (03PS2) 10Dzahn: site: remove phab1003, decom [puppet] - 10https://gerrit.wikimedia.org/r/563020 (https://phabricator.wikimedia.org/T238957) [23:40:57] (03PS2) 10Cwhite: lvs, monitoring: prometheus expects params value as string[] type [puppet] - 10https://gerrit.wikimedia.org/r/563306 (https://phabricator.wikimedia.org/T205870) [23:42:11] (03CR) 10Cwhite: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/20302/" [puppet] - 10https://gerrit.wikimedia.org/r/563306 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [23:42:19] (03CR) 10Dzahn: [C: 03+2] site: remove phab1003, decom [puppet] - 10https://gerrit.wikimedia.org/r/563020 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [23:43:54] (03CR) 10Cwhite: [C: 03+2] lvs, monitoring: prometheus expects params value as string[] type [puppet] - 10https://gerrit.wikimedia.org/r/563306 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [23:44:28] (03CR) 10Dzahn: "apt::package_from_component currently still has an issue in cloud afaict" [puppet] - 10https://gerrit.wikimedia.org/r/563208 (owner: 10Muehlenhoff) [23:49:07] ACKNOWLEDGEMENT - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_maps_eqsin,swagger_check_maps_esams,swagger_check_maps_ulsfo,swagger_check_restbase_eqsin,swagger_check_restbase_esams,swagger_check_restbase_ulsfo} site={eqsin,esams,ulsfo} cole_white working on rolling out new scraping rules https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org [23:49:07] etheus-targets [23:59:11] 10Puppet: Peculiar puppet agent error for apt::pin change - https://phabricator.wikimedia.org/T242383 (10Reedy)