[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191212T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:15:21] (03CR) 10Krinkle: [C: 04-1] "Indeed. mwdebug* are not debug proxies, they are debug app servers." [puppet] - 10https://gerrit.wikimedia.org/r/556302 (https://phabricator.wikimedia.org/T214734) (owner: 10Gergő Tisza) [00:17:20] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review, and 3 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Krinkle) @jijiki Can we depool mwdebug1002 again so that it can... [00:29:00] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10leila) >>! In T240303#5734123, @Krinkle wrote: > This question isn't directly related but might help indirectly clear some confusion: > > Who will pay for the domain name... [00:30:35] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [00:50:23] (03Abandoned) 10Gergő Tisza: Add MOTD to mwdebug1002 warning about T214734 [puppet] - 10https://gerrit.wikimedia.org/r/556302 (https://phabricator.wikimedia.org/T214734) (owner: 10Gergő Tisza) [01:00:05] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191212T0100). [01:15:03] (03Abandoned) 10Bstorm: toolforge-k8s: reduce the default terminated-pod-gc-threshold [puppet] - 10https://gerrit.wikimedia.org/r/555627 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [01:41:52] !log volker-e@deploy1001 Started deploy [design/style-guide@481eaf6]: Deploy design/style-guide: [01:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:00] !log volker-e@deploy1001 Finished deploy [design/style-guide@481eaf6]: Deploy design/style-guide: (duration: 00m 07s) [01:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:28] 10Operations, 10Traffic: Browser Connection Security warning page apparently produces invalid XML - https://phabricator.wikimedia.org/T240497 (10Reedy) [01:46:13] 10Operations, 10Traffic: Browser Connection Security warning page apparently produces invalid XML - https://phabricator.wikimedia.org/T240497 (10Reedy) It might be invalid XML, but https://validator.w3.org/ has no problem with it {F31473167} And https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img f... [03:21:12] (03CR) 10Andrew Bogott: [C: 03+1] "that regex is painful but lgtm :)" [puppet] - 10https://gerrit.wikimedia.org/r/556495 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [03:21:47] (03PS3) 10Andrew Bogott: Openstack: move eqiad1 to version 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/554829 (https://phabricator.wikimedia.org/T237749) [03:21:49] (03PS1) 10Andrew Bogott: Horizon: put in maintenance mode for the newton=>ocata upgrade [puppet] - 10https://gerrit.wikimedia.org/r/556515 (https://phabricator.wikimedia.org/T237749) [03:21:51] (03PS1) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the newton=>ocata upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/556516 [03:31:10] (03PS2) 10Andrew Bogott: Revert "Horizon: put in maintenance mode for the newton=>ocata upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/556516 (https://phabricator.wikimedia.org/T237749) [04:20:55] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:02] 10Operations, 10Traffic: Browser Connection Security warning page apparently produces invalid XML - https://phabricator.wikimedia.org/T240497 (10DavidBrooks) Oh, goodness, my thinking is muddled today. Blame a cold. This is an API query (GET /w/api.php?action=query...). It expects well-formed XML with some le... [05:47:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1097:3314 after schema change T233135', diff saved to https://phabricator.wikimedia.org/P9861 and previous config saved to /var/cache/conftool/dbconfig/20191212-054708-marostegui.json [05:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:16] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [05:47:52] !log Deploy schema change on db1102:3314 [05:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:13] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [05:56:13] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [05:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:51] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [05:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:59] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:25] (03PS1) 10Marostegui: site.pp: Remove puppet references from db2070 [puppet] - 10https://gerrit.wikimedia.org/r/556520 (https://phabricator.wikimedia.org/T239684) [05:57:27] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put in maintenance mode for the newton=>ocata upgrade [puppet] - 10https://gerrit.wikimedia.org/r/556515 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [05:57:33] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [05:57:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:41] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [05:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:42] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:18] (03PS1) 10Marostegui: wmnet: Remove production DNS for db2070 [dns] - 10https://gerrit.wikimedia.org/r/556521 (https://phabricator.wikimedia.org/T239684) [05:58:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:56] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references from db2070 [puppet] - 10https://gerrit.wikimedia.org/r/556520 (https://phabricator.wikimedia.org/T239684) (owner: 10Marostegui) [05:59:38] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS for db2070 [dns] - 10https://gerrit.wikimedia.org/r/556521 (https://phabricator.wikimedia.org/T239684) (owner: 10Marostegui) [05:59:56] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: move eqiad1 to version 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/554829 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [06:00:57] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2070.codfw.wmnet - https://phabricator.wikimedia.org/T239684 (10Marostegui) a:05Marostegui→03Papaul [06:01:17] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2070.codfw.wmnet - https://phabricator.wikimedia.org/T239684 (10Marostegui) This host is ready for @Papaul to finish the last steps [06:01:31] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [06:02:05] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) 05Open→03Resolved All these hosts have been decommissioned or pending the last on-site decommissioning steps. [06:02:07] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [06:11:07] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [06:11:09] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:34] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10Marostegui) No more occurrences of this query have... [06:47:00] !log Upgrade db1117 [06:47:00] marostegui: Failed to log message to wiki. Somebody should check the error logs. [07:09:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1137 for upgrade (duration: 01m 03s) [07:09:08] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [07:10:23] !log Upgrade db1137 [07:10:23] marostegui: Failed to log message to wiki. Somebody should check the error logs. [07:10:31] mmm, I guess it needs restart [07:13:22] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01065 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:20:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1137 after upgrade (duration: 01m 02s) [07:25:52] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003551 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:31:22] PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:31:54] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:45:50] PROBLEM - BFD status on cr1-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:47:36] RECOVERY - BFD status on cr1-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:47:38] RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:51:02] 10Operations: investigate making 'notrack' the default on our ferm rules - https://phabricator.wikimedia.org/T240495 (10akosiaris) There is the (possibly very very minor) upside of the current situation that the very first rule in INPUT is 1018219 3110598857 ACCEPT all -- * * 0.0.0.0/0... [07:52:25] 08Warning Alert for device cr2-eqsin.wikimedia.org - Traffic on tunnel link [07:59:09] 10Operations: investigate making 'notrack' the default on our ferm rules - https://phabricator.wikimedia.org/T240495 (10MoritzMuehlenhoff) >>! In T240495#5734771, @akosiaris wrote: > Just to understand this better, is this just about making $notrack default to true in https://gerrit.wikimedia.org/r/plugins/gitil... [07:59:12] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:02:25] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqsin.wikimedia.org recovered from Traffic on tunnel link [08:07:23] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:29:47] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: put in maintenance mode for the newton=>ocata upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/556516 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [08:30:36] (03CR) 10Elukey: [C: 03+2] cdh::hadoop: remove ipv6 constraints [puppet] - 10https://gerrit.wikimedia.org/r/556337 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [08:34:41] !log cleanup puppetmaster1001:/run/confd-template [08:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:48] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kibana on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [08:35:52] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kibana on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [08:36:38] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:42:54] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:42:56] PROBLEM - MegaRAID on db1123 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:42:57] ACKNOWLEDGEMENT - MegaRAID on db1123 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T240534 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:42:59] 10Operations, 10ops-eqiad: Degraded RAID on db1123 - https://phabricator.wikimedia.org/T240534 (10ops-monitoring-bot) [08:43:03] (03CR) 10Muehlenhoff: "Looks good, couple of nits inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556343 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [08:44:28] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1123 - https://phabricator.wikimedia.org/T240534 (10Marostegui) p:05Triage→03High This is 3 primary master, can we make sure we get this replaced before the holidays break? [08:45:15] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1123 - https://phabricator.wikimedia.org/T240534 (10Marostegui) a:03wiki_willy Assigning to @wiki_willy to make sure it is under #dc-ops radar [08:46:16] (03PS1) 10Elukey: cdh::hadoop: set a lens in augeas - follow up of previous commit [puppet] - 10https://gerrit.wikimedia.org/r/556533 [08:48:13] (03CR) 10Elukey: [C: 03+2] cdh::hadoop: set a lens in augeas - follow up of previous commit [puppet] - 10https://gerrit.wikimedia.org/r/556533 (owner: 10Elukey) [08:51:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (depending on the filepath mentioned in the other review). Could you please also open a task for Observability folks to make th" [puppet] - 10https://gerrit.wikimedia.org/r/556345 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [08:57:40] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6625 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [08:59:26] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.04167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [09:02:06] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10jcrespo) @BBlack Are Leila's answers covering all your questions or do you need additional information to propose a way to move forward? [09:04:31] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:12:23] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki - https://phabricator.wikimedia.org/T240405 (10jcrespo) Note the issue will only happen on a very... [09:12:25] 10Operations, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10TheDJ) BTW. We no longer have the cipher stats grafana board ? Too bad, that one was hella interesting. [09:12:52] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:16:12] (03PS1) 10Filippo Giunchedi: prometheus: add alerts for eventgate-logging-external latency/errors [puppet] - 10https://gerrit.wikimedia.org/r/556631 (https://phabricator.wikimedia.org/T226986) [09:24:15] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:30:49] 10Operations, 10Traffic: Setup a new PKI software as an alternative to the puppet CA for managing services certificates - https://phabricator.wikimedia.org/T194031 (10Joe) a:05Joe→03Volans [09:33:08] (03PS1) 10Elukey: cdh::hadoop: replace augeas with a file resource [puppet] - 10https://gerrit.wikimedia.org/r/556633 (https://phabricator.wikimedia.org/T240255) [09:34:01] (03PS2) 10Gehel: maps: enable replication cron [puppet] - 10https://gerrit.wikimedia.org/r/556525 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [09:34:50] (03CR) 10Gehel: [C: 03+2] maps: enable replication cron [puppet] - 10https://gerrit.wikimedia.org/r/556525 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [09:36:18] (03CR) 10Elukey: [C: 03+2] cdh::hadoop: replace augeas with a file resource [puppet] - 10https://gerrit.wikimedia.org/r/556633 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [09:37:20] !log Retroactive: deploy schema change on db1102:3314 [09:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:42] !log upgrading recently reimaged stretch hosts back to puppet 5 / facter 3 T239832 [09:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:51] T239832: Fix installation of Puppet 5/Facter 3 on new stretch installs/reimages - https://phabricator.wikimedia.org/T239832 [09:56:36] (03CR) 10Filippo Giunchedi: "> Patch Set 26:" [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [10:12:09] 10Operations, 10Research, 10SRE-Access-Requests: Google Search Console access request -- Isaac - https://phabricator.wikimedia.org/T240501 (10jcrespo) a:03jcrespo [10:12:27] 10Operations, 10Research, 10SRE-Access-Requests: Google Search Console access request -- Isaac - https://phabricator.wikimedia.org/T240501 (10jcrespo) p:05Triage→03Normal [10:23:43] jouncebot: no [10:23:44] jouncebot: now [10:23:44] No deployments scheduled for the next 1 hour(s) and 36 minute(s) [10:23:47] jouncebot: next [10:23:47] In 1 hour(s) and 36 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191212T1200) [10:26:23] (03PS1) 10Andrew Bogott: cloud-vps nfs: mkdir -p our mount points [puppet] - 10https://gerrit.wikimedia.org/r/556637 [10:29:46] 10Operations, 10ops-eqiad, 10netops: Circuit down between cr1-eqiad and cr1-codfw - https://phabricator.wikimedia.org/T240545 (10elukey) p:05Triage→03High [10:31:57] (03PS2) 10Ema: systemd: add icinga check for journal patterns [puppet] - 10https://gerrit.wikimedia.org/r/556343 (https://phabricator.wikimedia.org/T237608) [10:34:06] (03PS1) 10Tpt: Enables the Wikisource extension on all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556639 (https://phabricator.wikimedia.org/T240546) [10:36:20] (03CR) 10Ema: systemd: add icinga check for journal patterns (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556343 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [10:43:36] (03PS2) 10Ema: ATS: add icinga check for logs skipped by trafficserver{,-tls} [puppet] - 10https://gerrit.wikimedia.org/r/556345 (https://phabricator.wikimedia.org/T237608) [10:43:55] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps nfs: mkdir -p our mount points [puppet] - 10https://gerrit.wikimedia.org/r/556637 (owner: 10Andrew Bogott) [10:45:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, one nit inline, but feel free to ignore." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556343 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [10:48:49] (03PS3) 10Ema: systemd: add icinga check for journal patterns [puppet] - 10https://gerrit.wikimedia.org/r/556343 (https://phabricator.wikimedia.org/T237608) [10:48:50] (03PS3) 10Ema: ATS: add icinga check for logs skipped by trafficserver{,-tls} [puppet] - 10https://gerrit.wikimedia.org/r/556345 (https://phabricator.wikimedia.org/T237608) [10:49:53] (03CR) 10Ema: systemd: add icinga check for journal patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556343 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [10:50:32] (03PS1) 10Elukey: hadoop: remove ipv6 constraint workaround [puppet] - 10https://gerrit.wikimedia.org/r/556641 (https://phabricator.wikimedia.org/T240255) [10:51:03] (03CR) 10Ema: [C: 03+2] systemd: add icinga check for journal patterns [puppet] - 10https://gerrit.wikimedia.org/r/556343 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [10:52:40] (03PS2) 10Elukey: hadoop: remove ipv6 constraint workaround [puppet] - 10https://gerrit.wikimedia.org/r/556641 (https://phabricator.wikimedia.org/T240255) [10:55:17] (03CR) 10Elukey: [C: 03+2] hadoop: remove ipv6 constraint workaround [puppet] - 10https://gerrit.wikimedia.org/r/556641 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [11:00:18] PROBLEM - DPKG on mw2272 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:01:18] PROBLEM - DPKG on mw2264 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:01:33] ok this is spreading [11:02:14] it's just some icinga noise, I'm rolling out the puppet 5 packages which got lost by reimages [11:02:21] T239832 [11:02:22] T239832: Fix installation of Puppet 5/Facter 3 on new stretch installs/reimages - https://phabricator.wikimedia.org/T239832 [11:02:28] should recover shortly [11:03:56] RECOVERY - DPKG on mw2272 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:04:54] RECOVERY - DPKG on mw2264 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:07:11] Hi, unfortunately I will not be able to make to the start of today's EU SWAT and I will be about 20 mins late. Sorry! [11:08:12] PROBLEM - DPKG on cp1077 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:08:50] ok moritzm tz [11:08:51] tx [11:11:48] RECOVERY - DPKG on cp1077 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:12:43] jouncebot: now [11:12:43] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [11:14:56] PROBLEM - DPKG on cp3055 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:15:11] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [11:16:00] PROBLEM - DPKG on cp5012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:16:44] RECOVERY - DPKG on cp3055 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:17:13] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/Wikibase: BACKPORTS: wikibase tainted refs https://gerrit.wikimedia.org/r/#/q/topic:backports-wd-tainted-1 (duration: 01m 08s) [11:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:50] RECOVERY - DPKG on cp5012 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:30:21] (03PS1) 10Jbond: etcd::client::globalconfig: add types to signiture [puppet] - 10https://gerrit.wikimedia.org/r/556646 [11:30:23] (03PS1) 10Jbond: etcd::client::globalconfig: add ca_cert [puppet] - 10https://gerrit.wikimedia.org/r/556647 (https://phabricator.wikimedia.org/T237362) [11:31:00] 10Operations, 10Puppet, 10serviceops, 10Patch-For-Review, 10User-jbond: Rolling restart of etcd to pick up the renewed CA public certificate. - https://phabricator.wikimedia.org/T237362 (10jbond) Have been trying to document some of the [[ https://wikitech.wikimedia.org/wiki/User:Jbond/Encryption#Conf_to... [11:35:58] (03PS1) 10Hashar: ci: remove zuul-cloner package from Jenkins agents [puppet] - 10https://gerrit.wikimedia.org/r/556648 (https://phabricator.wikimedia.org/T240551) [11:36:45] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/556648 (https://phabricator.wikimedia.org/T240551) (owner: 10Hashar) [11:37:14] jouncebot: next [11:37:14] In 0 hour(s) and 22 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191212T1200) [11:38:07] (03PS2) 10Hashar: ci: remove zuul-cloner package from Jenkins agents [puppet] - 10https://gerrit.wikimedia.org/r/556648 (https://phabricator.wikimedia.org/T240551) [11:38:25] 10Operations, 10observability, 10serviceops: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10jijiki) [11:38:28] 10Operations, 10observability, 10serviceops: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10jijiki) [11:38:57] (03PS2) 10Jbond: etcd::client::globalconfig: add types to signiture [puppet] - 10https://gerrit.wikimedia.org/r/556646 [11:41:27] !log Removing zuul package from Jessie CI instances # T240551 [11:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:33] T240551: Remove Zuul Debian package from WMCS instances - https://phabricator.wikimedia.org/T240551 [11:42:15] (03PS3) 10Jbond: etcd::client::globalconfig: add types to signiture [puppet] - 10https://gerrit.wikimedia.org/r/556646 [11:42:40] (03PS1) 10Muehlenhoff: Decom puppetdb1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/556650 [11:43:40] (03PS4) 10Jbond: etcd::client::globalconfig: add types to signiture [puppet] - 10https://gerrit.wikimedia.org/r/556646 [11:43:54] (03PS2) 10Jbond: etcd::client::globalconfig: add ca_cert [puppet] - 10https://gerrit.wikimedia.org/r/556647 (https://phabricator.wikimedia.org/T237362) [11:45:41] (03PS3) 10Jbond: etcd::client::globalconfig: add ca_cert [puppet] - 10https://gerrit.wikimedia.org/r/556647 (https://phabricator.wikimedia.org/T237362) [11:45:50] !log jmm@cumin2001 START - Cookbook sre.hosts.decommission [11:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:41] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [11:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:46] 10Puppet, 10Patch-For-Review, 10User-jbond: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `puppetdb2001.codfw.wmnet` - puppetdb2001.codfw.wmnet (**FAIL**) - Downtimed hos... [11:46:47] 10Puppet, 10Patch-For-Review, 10User-jbond: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `puppetdb2001.codfw.wmnet` - puppetdb2001.codfw.wmnet (**FAIL**) - Downtimed hos... [11:49:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/556650 (owner: 10Muehlenhoff) [11:49:07] !log removing puppetdb2001 from Ganeti [11:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:54] (03PS2) 10Muehlenhoff: Decom puppetdb1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/556650 [11:59:38] (03CR) 10Muehlenhoff: [C: 03+2] Decom puppetdb1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/556650 (owner: 10Muehlenhoff) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191212T1200). [12:00:05] tassu: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:10] Hi. Can you wait about 20 mins for me please? [12:00:35] tassu: yes [12:00:52] o/ [12:04:56] 10Operations: investigate making 'notrack' the default on our ferm rules - https://phabricator.wikimedia.org/T240495 (10BBlack) Yes, it's about that `$notrack` default. My hypothesis is that setting it to true wouldn't break any traffic, wouldn't change the security situation much, but would eliminate a bunch o... [12:04:57] 10Operations: investigate making 'notrack' the default on our ferm rules - https://phabricator.wikimedia.org/T240495 (10BBlack) Yes, it's about that `$notrack` default. My hypothesis is that setting it to true wouldn't break any traffic, wouldn't change the security situation much, but would eliminate a bunch o... [12:16:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nginx-ingress: Have ingress pods request realistic resting resources [puppet] - 10https://gerrit.wikimedia.org/r/556404 (https://phabricator.wikimedia.org/T239405) (owner: 10Bstorm) [12:19:18] tassu: are you ready? [12:19:30] yep [12:19:34] apologies for the delay [12:20:04] (03CR) 10Urbanecm: [C: 03+2] Enable SandboxLink extension on hywwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556422 (https://phabricator.wikimedia.org/T239387) (owner: 10Majavah) [12:20:08] np [12:21:03] (03Merged) 10jenkins-bot: Enable SandboxLink extension on hywwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556422 (https://phabricator.wikimedia.org/T239387) (owner: 10Majavah) [12:32:12] tassu: could you please test your patch at mwdebug1001, please? [12:32:20] sure [12:32:48] (03PS1) 10Muehlenhoff: Extend cleanup list after upgrade with "git-core" [puppet] - 10https://gerrit.wikimedia.org/r/556657 [12:32:50] it's working [12:33:07] thank you tassu [12:33:47] syncing [12:34:47] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 1c58f09: Enable SandboxLink extension on hywwiki (T239387) (duration: 01m 03s) [12:34:51] tassu: done! [12:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:53] T239387: Enable SandboxLink extension on hywwiki - https://phabricator.wikimedia.org/T239387 [12:35:02] great, thanks [12:35:08] can you approve the GCI task as well? [12:35:26] sure [12:35:44] done [12:36:00] thank you [12:36:06] (03PS3) 10Urbanecm: Add 2020: Wikimania namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556234 (https://phabricator.wikimedia.org/T240339) (owner: 10Ammarpad) [12:36:07] yw [12:36:13] (03CR) 10Urbanecm: [C: 03+2] Add 2020: Wikimania namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556234 (https://phabricator.wikimedia.org/T240339) (owner: 10Ammarpad) [12:37:13] (03Merged) 10jenkins-bot: Add 2020: Wikimania namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556234 (https://phabricator.wikimedia.org/T240339) (owner: 10Ammarpad) [12:37:14] !log installing NSS security updates on buster [12:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:47] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 07652a6: Add 2020: Wikimania namespace (T240339) (duration: 01m 02s) [12:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:52] T240339: Create Wikimania 2020 namespace - https://phabricator.wikimedia.org/T240339 [12:39:25] !log EU SWAT done [12:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:00] hi, I could use a puppet merge for CI instances please :) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556648/ [12:41:04] doing some cleanup \o/ [12:42:33] (03PS1) 10Muehlenhoff: Add Cumin alias for IDPs [puppet] - 10https://gerrit.wikimedia.org/r/556658 [12:42:38] (03CR) 10Pmiazga: "in repo I'd like to enable that for base" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556440 (https://phabricator.wikimedia.org/T232652) (owner: 10Pmiazga) [12:42:44] (03PS1) 10Jbond: cas: merge upstream 6.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/556659 [12:44:45] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for IDPs [puppet] - 10https://gerrit.wikimedia.org/r/556658 (owner: 10Muehlenhoff) [12:44:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/556658 (owner: 10Muehlenhoff) [12:47:57] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/556659 (owner: 10Jbond) [12:48:38] (03PS1) 10MSantos: Disable populate_admin script [puppet] - 10https://gerrit.wikimedia.org/r/556661 (https://phabricator.wikimedia.org/T240227) [12:49:55] (03PS2) 10MSantos: Disable populate_admin script [puppet] - 10https://gerrit.wikimedia.org/r/556661 (https://phabricator.wikimedia.org/T240227) [12:52:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] cas: merge upstream 6.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/556659 (owner: 10Jbond) [12:55:02] 10Operations, 10Traffic: Browser Connection Security warning page apparently produces invalid XML - https://phabricator.wikimedia.org/T240497 (10Reedy) It kinda is and it isn't. You've been served a static HTML error page that isn't served by the MW API, it's coming from the caches infront of MediaWiki. It doe... [12:55:04] 10Operations, 10Traffic: Browser Connection Security warning page apparently produces invalid XML - https://phabricator.wikimedia.org/T240497 (10Reedy) It kinda is and it isn't. You've been served a static HTML error page that isn't served by the MW API, it's coming from the caches infront of MediaWiki. It doe... [12:55:23] (03CR) 10Mathew.onipe: [C: 03+1] Disable populate_admin script [puppet] - 10https://gerrit.wikimedia.org/r/556661 (https://phabricator.wikimedia.org/T240227) (owner: 10MSantos) [12:58:35] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [12:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:54] 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10Aklapper) [12:58:57] 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10Aklapper) [13:08:39] (03PS1) 10Jbond: apereo_cas: update build logic [puppet] - 10https://gerrit.wikimedia.org/r/556664 [13:09:33] (03CR) 10Jbond: "PCC https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/19944" [puppet] - 10https://gerrit.wikimedia.org/r/556664 (owner: 10Jbond) [13:10:15] (03CR) 10Jbond: [C: 03+2] apereo_cas: update build logic [puppet] - 10https://gerrit.wikimedia.org/r/556664 (owner: 10Jbond) [13:12:10] (03CR) 10Masumrezarock100: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556639 (https://phabricator.wikimedia.org/T240546) (owner: 10Tpt) [13:12:54] 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10Reedy) Noting this is an output of {T238038} [13:13:05] (03PS1) 10Jbond: noop commit to test puppet build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/556665 [13:13:27] (03CR) 10Jbond: [V: 03+2 C: 03+2] noop commit to test puppet build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/556665 (owner: 10Jbond) [13:14:13] 10Operations, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Reedy) >>! In T238038#5727398, @TheDJ wrote: > Question. https://wikitech.wikimedia.org/wiki/HTTPS/Browser_Recommendations > > Windows 7: I know it CAN support TLS 1.2, but I can't... [13:19:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:20:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:21:22] (03CR) 10Jbond: [C: 03+2] ci: remove zuul-cloner package from Jenkins agents [puppet] - 10https://gerrit.wikimedia.org/r/556648 (https://phabricator.wikimedia.org/T240551) (owner: 10Hashar) [13:28:17] (03CR) 10Ema: [C: 03+2] ATS: add icinga check for logs skipped by trafficserver{,-tls} [puppet] - 10https://gerrit.wikimedia.org/r/556345 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [13:29:45] 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10BBlack) The way it works is that if the connection isn't using TLSv1.2, the user is served a 302 redirect to `/sec-warning` o... [13:34:26] 10Operations, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10BBlack) >>! In T238038#5734955, @TheDJ wrote: > BTW. We no longer have the cipher stats grafana board ? Too bad, that one was hella interesting. The old cipher stats graphs (the ori... [13:38:51] !log contint1001 / contint2001 : upgraded Zuul to 2.5.1-wmf11 # T203846 [13:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:57] T203846: Zuul cancels all changes when a change is manually merged - https://phabricator.wikimedia.org/T203846 [13:39:06] 10Operations, 10ops-eqiad, 10netops: Circuit down between cr1-eqiad and cr1-codfw - https://phabricator.wikimedia.org/T240545 (10Jclark-ctr) a:03Jclark-ctr [13:42:06] 10Operations, 10Continuous-Integration-Infrastructure, 10Gerrit, 10Release-Engineering-Team (CI & Testing services), and 2 others: Upload zuul_2.5.1-wmf11 to apt.wikimedia.org - https://phabricator.wikimedia.org/T240570 (10hashar) [13:42:09] 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10Aklapper) Thanks for the detailed explanation. Does that mean the external parser should check the header to realize the mime... [13:42:22] 10Operations, 10Continuous-Integration-Infrastructure, 10Gerrit, 10Release-Engineering-Team (CI & Testing services), and 2 others: Upload zuul_2.5.1-wmf11 to apt.wikimedia.org - https://phabricator.wikimedia.org/T240570 (10hashar) p:05Triage→03Normal a:05hashar→03None [13:43:51] (03CR) 10Hashar: "recheck debian-glue job is now voting and I have done some changes to the pbuilder configuration." [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar) [13:46:24] 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10BBlack) I'm not even sure what the task is asking for, but yeah in general we're not going to make the sec-warning mechanism... [13:48:30] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Grant "contint-roots" and "releasers-mediawiki" to user brennen - https://phabricator.wikimedia.org/T240382 (10hashar) `+1` :) [14:01:52] (03PS3) 10Gehel: Disable populate_admin script [puppet] - 10https://gerrit.wikimedia.org/r/556661 (https://phabricator.wikimedia.org/T240227) (owner: 10MSantos) [14:02:49] !log merge puppet-merge refactor [14:02:51] (03CR) 10Gehel: [C: 03+2] Disable populate_admin script [puppet] - 10https://gerrit.wikimedia.org/r/556661 (https://phabricator.wikimedia.org/T240227) (owner: 10MSantos) [14:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:24] (03CR) 10Jbond: [C: 03+2] puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [14:06:46] (03PS1) 10Jbond: puppet-merge test: test puppet-merge deployed to one puppet master [puppet] - 10https://gerrit.wikimedia.org/r/556673 [14:07:00] (03PS1) 10BBlack: sec-warning: handle non-GET better [puppet] - 10https://gerrit.wikimedia.org/r/556674 (https://phabricator.wikimedia.org/T238038) [14:07:31] (03CR) 10Jbond: [C: 03+2] puppet-merge test: test puppet-merge deployed to one puppet master [puppet] - 10https://gerrit.wikimedia.org/r/556673 (owner: 10Jbond) [14:08:41] !log Upgrade db2085 and db2086 [14:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:01] (03PS1) 10Jbond: "puppet-merge test: test puppet-merge deployed to one puppet master" more testing [puppet] - 10https://gerrit.wikimedia.org/r/556676 [14:09:37] (03CR) 10jerkins-bot: [V: 04-1] "puppet-merge test: test puppet-merge deployed to one puppet master" more testing [puppet] - 10https://gerrit.wikimedia.org/r/556676 (owner: 10Jbond) [14:10:02] (03PS2) 10Jbond: "puppet-merge test: test puppet-merge deployed to one puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/556676 [14:10:42] (03CR) 10Jbond: [C: 03+2] "puppet-merge test: test puppet-merge deployed to one puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/556676 (owner: 10Jbond) [14:13:45] (03PS1) 10Hashar: contint: remove role ci::slave::saucelabs [puppet] - 10https://gerrit.wikimedia.org/r/556677 (https://phabricator.wikimedia.org/T240575) [14:13:49] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-m [14:14:46] jbond42: if you don't mind, I have another trivial puppet cleanup patch for you https://gerrit.wikimedia.org/r/#/c/556677/ :) [14:14:54] that drops a role that we no more use! [14:15:51] patches are probably still on hold for a few minutes while puppet-merge is being worked on [14:16:26] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [14:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:46] hashar: in the middle of a change just now can you ping me in an hour [14:17:56] yeah sorry ;D [14:18:12] I just noticed you +2ed a change and though you could get mine in as well [14:18:13] but [14:18:30] your is the puppet-merge thing, so yeah that is not a good time ;]]] [14:18:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [14:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:57] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [14:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:32] 10Operations, 10Traffic, 10serviceops: Appservers behind TLS should support chunked Transfer-Encoding - https://phabricator.wikimedia.org/T240576 (10ema) [14:21:41] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [14:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:45] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:23:47] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:24:24] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [14:30:05] !log pool maps1004 osm-import is complete - T239728 [14:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:12] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [14:31:02] (03PS1) 10Elukey: dumps::web::fetches::stats: move to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) [14:35:27] jouncebot: now [14:35:27] No deployments scheduled for the next 2 hour(s) and 24 minute(s) [14:35:56] (03PS2) 10Elukey: dumps::web::fetches::stats: move to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) [14:38:28] (03PS3) 10Elukey: dumps::web::fetches::stats: move to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) [14:38:43] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@7dc11d4]: Update mobileapps to 65272a6 [14:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:59] !log depool maps1001 for postgres reinitialization - T239728 [14:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:07] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [14:44:01] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:44:56] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@7dc11d4]: Update mobileapps to 65272a6 (duration: 06m 12s) [14:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:59] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/19949/" [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [14:46:14] 10Operations, 10Traffic, 10serviceops: Appservers behind TLS should support chunked Transfer-Encoding - https://phabricator.wikimedia.org/T240576 (10Joe) While it should be easy to swap nginx for envoy, we need to also convert `profile::services_proxy` to use envoy at the same time. It should not be impossi... [14:48:17] (03PS1) 10Hashar: contint: remove role ci::slave::browsertests [puppet] - 10https://gerrit.wikimedia.org/r/556695 [14:50:57] (03PS2) 10Hashar: contint: remove role ci::slave::browsertests [puppet] - 10https://gerrit.wikimedia.org/r/556695 (https://phabricator.wikimedia.org/T220035) [14:52:31] (03CR) 10Joal: [C: 04-1] "Wont work because of trailing spaces in source" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [14:52:48] (03PS4) 10Elukey: dumps::web::fetches::stats: move to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) [14:53:45] (03CR) 10Ema: [C: 03+1] Extend cleanup list after upgrade with "git-core" [puppet] - 10https://gerrit.wikimedia.org/r/556657 (owner: 10Muehlenhoff) [14:55:53] (03CR) 10Elukey: dumps::web::fetches::stats: move to systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [14:56:29] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.10/includes/specials/SpecialUserrights.php: T240574 (duration: 01m 02s) [14:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:34] T240574: SpecialUserrights.php: Call to undefined method CentralAuthGroupMembershipProxy::isSystemUser() - https://phabricator.wikimedia.org/T240574 [14:56:37] (03CR) 10Filippo Giunchedi: [C: 03+1] Extend cleanup list after upgrade with "git-core" [puppet] - 10https://gerrit.wikimedia.org/r/556657 (owner: 10Muehlenhoff) [14:58:11] (03CR) 10Ottomata: [C: 03+1] prometheus: add alerts for eventgate-logging-external latency/errors [puppet] - 10https://gerrit.wikimedia.org/r/556631 (https://phabricator.wikimedia.org/T226986) (owner: 10Filippo Giunchedi) [15:01:49] 10Operations, 10Maps, 10Discovery-Search (Current work): Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Mathew.onipe) [15:02:47] 10Operations: investigate making 'notrack' the default on our ferm rules - https://phabricator.wikimedia.org/T240495 (10akosiaris) >>! In T240495#5735553, @BBlack wrote: > Yes, it's about that `$notrack` default. My hypothesis is that setting it to true wouldn't break any traffic, wouldn't change the security s... [15:03:09] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:03:19] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:03:20] 10Operations, 10Analytics, 10SRE-Access-Requests: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) 05Open→03Resolved @ACraze seems to be unavailable. Resolving, but please reopen if you found issues later. [15:03:23] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to stats machines/ores hosts hosts for Andy Craze - https://phabricator.wikimedia.org/T226204 (10jcrespo) [15:03:33] 10Operations, 10Analytics, 10SRE-Access-Requests: Add accraze to analytics-privatedata-users - https://phabricator.wikimedia.org/T240243 (10jcrespo) a:05ACraze→03jcrespo [15:04:16] 10Operations, 10observability, 10serviceops: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10akosiaris) p:05Triage→03High I am gonna triage as high because there is the fear that we are currently losin... [15:06:47] (03CR) 10Ema: [C: 03+1] Public routing from intake-logging.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/556413 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:06:57] (03PS8) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) [15:07:37] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:07:49] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:07:54] this is John working on it --^ [15:11:29] 10Operations, 10observability, 10serviceops: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10Ottomata) This error message might be a false positive. I think librdkafka spits it out (is rsyslog using librd... [15:12:01] (03CR) 10Ottomata: [C: 03+2] Add intake-{logging,analytics}.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/556411 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:12:04] (03PS2) 10Ottomata: Add intake-{logging,analytics}.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/556411 (https://phabricator.wikimedia.org/T236386) [15:13:00] !log deleting puppetdb1001 in Ganeti T228657 [15:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:06] T228657: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 [15:13:59] PROBLEM - Unmerged changes on repository puppet on puppetmaster1003 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:15:35] PROBLEM - Host puppetdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:53] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:16:09] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:16:11] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:33] (03CR) 10Ottomata: [C: 03+1] cdh::hadoop: replace augeas with a file resource [puppet] - 10https://gerrit.wikimedia.org/r/556633 (https://phabricator.wikimedia.org/T240255) (owner: 10Elukey) [15:17:19] (03CR) 10Ottomata: [C: 03+2] Public routing from intake-logging.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/556413 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:17:26] (03PS2) 10Ottomata: Public routing from intake-logging.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/556413 (https://phabricator.wikimedia.org/T236386) [15:18:27] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [15:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:56] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [15:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:05] 10Puppet, 10Patch-For-Review, 10User-jbond: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `puppetdb1001.eqiad.wmnet` - puppetdb1001.eqiad.wmnet (**FAIL**) - Downtimed hos... [15:21:02] (03PS1) 10Jbond: puppet-merge: fix some issue observer during deploy [puppet] - 10https://gerrit.wikimedia.org/r/556699 [15:21:18] (03PS2) 10Muehlenhoff: Complete package list for slice pinning on stat/notebook [puppet] - 10https://gerrit.wikimedia.org/r/556213 [15:21:46] (03CR) 10CDanis: [C: 04-2] "> Patch Set 3: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556236 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [15:23:28] (03CR) 10Marostegui: [C: 03+1] "> > Patch Set 3: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556236 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [15:29:28] (03CR) 10Muehlenhoff: [C: 03+2] Complete package list for slice pinning on stat/notebook [puppet] - 10https://gerrit.wikimedia.org/r/556213 (owner: 10Muehlenhoff) [15:30:18] (03PS2) 10Jbond: puppet-merge: fix some issue observer during deploy [puppet] - 10https://gerrit.wikimedia.org/r/556699 [15:30:58] 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10DavidBrooks) Thanks, BBlack, for the obvious thoughtful care that went into this. And, in my case, it had the desired end-res... [15:31:07] (03CR) 10Volans: "Some comment inline, I don't have the full picture of the script right now in my head to give a real vote, but if needed I can spend the t" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556699 (owner: 10Jbond) [15:33:22] (03CR) 10CDanis: puppet-merge: fix some issue observer during deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556699 (owner: 10Jbond) [15:35:52] (03PS6) 10Volans: Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [15:35:54] (03PS1) 10Volans: netbox: split generic and device-specific data [software/homer] - 10https://gerrit.wikimedia.org/r/556703 (https://phabricator.wikimedia.org/T228388) [15:37:59] (03CR) 10Jbond: "> Patch Set 2:" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/556699 (owner: 10Jbond) [15:38:13] (03PS3) 10Jbond: puppet-merge: fix some issue observed during deploy [puppet] - 10https://gerrit.wikimedia.org/r/556699 [15:38:28] (03CR) 10jerkins-bot: [V: 04-1] netbox: split generic and device-specific data [software/homer] - 10https://gerrit.wikimedia.org/r/556703 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [15:38:49] (03CR) 10jerkins-bot: [V: 04-1] Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [15:39:53] (03CR) 10CDanis: [C: 03+1] puppet-merge: fix some issue observed during deploy [puppet] - 10https://gerrit.wikimedia.org/r/556699 (owner: 10Jbond) [15:41:39] (03CR) 10Jbond: [C: 03+2] puppet-merge: fix some issue observed during deploy [puppet] - 10https://gerrit.wikimedia.org/r/556699 (owner: 10Jbond) [15:42:10] (03PS2) 10Filippo Giunchedi: prometheus: add alerts for eventgate-logging-external latency/errors [puppet] - 10https://gerrit.wikimedia.org/r/556631 (https://phabricator.wikimedia.org/T226986) [15:45:28] (03PS1) 10Jbond: puppet-merge test: test puppet-merge deployed to one puppet master"" [puppet] - 10https://gerrit.wikimedia.org/r/556705 [15:45:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet-merge test: test puppet-merge deployed to one puppet master"" [puppet] - 10https://gerrit.wikimedia.org/r/556705 (owner: 10Jbond) [15:45:52] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add alerts for eventgate-logging-external latency/errors [puppet] - 10https://gerrit.wikimedia.org/r/556631 (https://phabricator.wikimedia.org/T226986) (owner: 10Filippo Giunchedi) [15:46:14] godog: still testing puppet-merge a bit you happy for me to merge yours [15:46:25] jbond42: yup! thank you, was about to ping you :) [15:46:31] ack thanks [15:48:03] (03PS1) 10Jbond: Revert "puppet-merge test: test puppet-merge deployed to one puppet master""" [puppet] - 10https://gerrit.wikimedia.org/r/556707 [15:48:09] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppet-merge test: test puppet-merge deployed to one puppet master""" [puppet] - 10https://gerrit.wikimedia.org/r/556707 (owner: 10Jbond) [15:49:30] RECOVERY - Unmerged changes on repository puppet on puppetmaster1003 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:51:33] (03PS1) 10Jbond: puppet-merge: store and test the exit code [puppet] - 10https://gerrit.wikimedia.org/r/556708 [15:53:08] (03CR) 10Jbond: [C: 03+2] puppet-merge: store and test the exit code [puppet] - 10https://gerrit.wikimedia.org/r/556708 (owner: 10Jbond) [15:53:53] (03CR) 10Muehlenhoff: [C: 03+2] Extend cleanup list after upgrade with "git-core" [puppet] - 10https://gerrit.wikimedia.org/r/556657 (owner: 10Muehlenhoff) [15:55:41] 10Operations, 10Kubernetes: Migrate etcd cluster for Kubernetes staging cluster to Stretch/Buster - https://phabricator.wikimedia.org/T224568 (10akosiaris) [15:55:44] 10Operations, 10serviceops, 10Kubernetes: Migrate Kubernetes etcd clusters to Stretch/Buster - https://phabricator.wikimedia.org/T224574 (10akosiaris) [15:55:47] 10Operations, 10serviceops, 10Kubernetes: Migrate etcd networking cluster to Stretch/Buster - https://phabricator.wikimedia.org/T224577 (10akosiaris) [15:55:55] 10Operations, 10Research, 10SRE-Access-Requests: Google Search Console access request -- Isaac - https://phabricator.wikimedia.org/T240501 (10jcrespo) a:05jcrespo→03Isaac Isaac: Access with your email account as it appears on the HR tool has been provided (restricted-read only- account), as based on your... [15:56:46] (03PS1) 10Jbond: puppet-merge test: test puppet-merge deployed to one puppet master"""" [puppet] - 10https://gerrit.wikimedia.org/r/556711 [15:57:06] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet-merge test: test puppet-merge deployed to one puppet master"""" [puppet] - 10https://gerrit.wikimedia.org/r/556711 (owner: 10Jbond) [15:58:02] jouncebot: now [15:58:02] No deployments scheduled for the next 1 hour(s) and 1 minute(s) [15:58:05] jouncebot: next [15:58:05] In 1 hour(s) and 1 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191212T1700) [15:58:06] (03PS1) 10CDanis: dbctl: update schemata for 1.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/556712 (https://phabricator.wikimedia.org/T229686) [15:59:53] !log T229686 ✔️ cdanis@install1002.wikimedia.org ~ 🕚☕ sudo -E reprepro -C main include stretch-wikimedia conftool_1.3.0-1_amd64.changes [15:59:54] (03CR) 10jerkins-bot: [V: 04-1] dbctl: update schemata for 1.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/556712 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [15:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:59] T229686: #dbctl: manage 'externalLoads' data - https://phabricator.wikimedia.org/T229686 [16:00:20] (03PS1) 10Alexandros Kosiaris: etcd: Align better etcdv2 and etcdv3 profiles [puppet] - 10https://gerrit.wikimedia.org/r/556713 (https://phabricator.wikimedia.org/T239835) [16:00:44] (03PS2) 10Volans: netbox: split generic and device-specific data [software/homer] - 10https://gerrit.wikimedia.org/r/556703 (https://phabricator.wikimedia.org/T228388) [16:00:46] (03PS7) 10Volans: Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [16:01:09] !log sudo -E reprepro -C main include buster-wikimedia conftool_1.3.0-1+deb10u1_amd64.changes [16:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:19] (03CR) 10Bstorm: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [16:02:52] !log T229686 upgrade python3-conftool and python3-conftool-dbctl on cumin hosts [16:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:37] (03PS1) 10Jbond: puppet-merge: pass $1 unchanged [puppet] - 10https://gerrit.wikimedia.org/r/556718 [16:04:48] (03PS1) 10Elukey: role::search::airflow: add analytics cluster users [puppet] - 10https://gerrit.wikimedia.org/r/556719 (https://phabricator.wikimedia.org/T236180) [16:05:00] (03CR) 10Jbond: [C: 03+2] puppet-merge: pass $1 unchanged [puppet] - 10https://gerrit.wikimedia.org/r/556718 (owner: 10Jbond) [16:05:34] (03PS1) 10Jbond: puppet-merge test: test puppet-merge deployed to one puppet master [puppet] - 10https://gerrit.wikimedia.org/r/556720 [16:06:30] (03PS2) 10CDanis: dbctl: update schemata for 1.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/556712 (https://phabricator.wikimedia.org/T229686) [16:06:44] (03CR) 10Elukey: [C: 03+2] role::search::airflow: add analytics cluster users [puppet] - 10https://gerrit.wikimedia.org/r/556719 (https://phabricator.wikimedia.org/T236180) (owner: 10Elukey) [16:06:49] (03CR) 10Jbond: [C: 03+2] puppet-merge test: test puppet-merge deployed to one puppet master [puppet] - 10https://gerrit.wikimedia.org/r/556720 (owner: 10Jbond) [16:07:40] elukey: i merged yours hope thats ok [16:07:44] PROBLEM - traffic_server backend process restarted on cp4028 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=ulsfo+prometheus/ops&var-instance=cp4028&var-layer=backend [16:08:18] (03CR) 10jerkins-bot: [V: 04-1] dbctl: update schemata for 1.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/556712 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [16:08:44] jbond42: thanks! I was staring at an empty puppet-merge and I was wondering if I needed coffee or not :D [16:08:46] 10Operations, 10ops-eqiad, 10netops: Circuit down between cr1-eqiad and cr1-codfw - https://phabricator.wikimedia.org/T240545 (10Jclark-ctr) Replaced failed Fiber [16:08:48] (03CR) 10Phamhi: [C: 03+1] nginx-ingress: Have ingress pods request realistic resting resources [puppet] - 10https://gerrit.wikimedia.org/r/556404 (https://phabricator.wikimedia.org/T239405) (owner: 10Bstorm) [16:08:50] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/556713 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [16:08:58] 10Operations, 10ops-eqiad, 10netops: Circuit down between cr1-eqiad and cr1-codfw - https://phabricator.wikimedia.org/T240545 (10Jclark-ctr) 05Open→03Resolved [16:08:59] :) [16:09:30] (03CR) 10Phamhi: [C: 03+1] ceph: add rbd client support [puppet] - 10https://gerrit.wikimedia.org/r/556488 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [16:09:50] !log installing libvorbis security updates [16:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:21] (03PS1) 10CDanis: taskgen: jsonschema: bugfix [puppet] - 10https://gerrit.wikimedia.org/r/556721 [16:13:06] (03CR) 10CDanis: [C: 03+2] taskgen: jsonschema: bugfix [puppet] - 10https://gerrit.wikimedia.org/r/556721 (owner: 10CDanis) [16:13:17] (03PS1) 10Elukey: profile::analytics::search::airflow: fix the scheduler's syslog id [puppet] - 10https://gerrit.wikimedia.org/r/556722 (https://phabricator.wikimedia.org/T236180) [16:14:25] (03PS3) 10CDanis: dbctl: update schemata for 1.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/556712 (https://phabricator.wikimedia.org/T229686) [16:15:40] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:15:55] lol yes [16:15:57] expected [16:16:06] I blame Ruby 🙃 [16:16:09] lol [16:16:18] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:16:26] (03CR) 10CDanis: [C: 03+2] dbctl: update schemata for 1.3.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/556712 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [16:16:37] (03CR) 10Elukey: [C: 03+2] profile::analytics::search::airflow: fix the scheduler's syslog id [puppet] - 10https://gerrit.wikimedia.org/r/556722 (https://phabricator.wikimedia.org/T236180) (owner: 10Elukey) [16:17:04] ehm [16:17:05] elukey@puppetmaster1001:~$ sudo -i puppet-merge [16:17:05] /usr/local/bin/puppet-merge: 38: /usr/local/bin/puppet-merge: die: not found [16:17:17] jbond42: ^^^ [16:17:18] jbond42: --^ [16:17:50] probably I was racing with cdanis and getting to die [16:17:56] * elukey blames cdanis [16:18:18] ah no failed everywhere [16:18:27] (03PS3) 10Jhedden: openstack: change cloudvirt1022 to ceph based virt role [puppet] - 10https://gerrit.wikimedia.org/r/556495 (https://phabricator.wikimedia.org/T239918) [16:18:42] with "no changes to merge" [16:20:01] elukey: so the lock detection is not working? [16:20:44] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:20:59] jbond42: I guess so, I didn't follow the recent changes to puppet-merge though :( [16:21:19] ack ill look thanks [16:21:32] (03CR) 10Jhedden: [C: 03+2] ceph: add rbd client support [puppet] - 10https://gerrit.wikimedia.org/r/556488 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [16:23:22] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:24:11] jbond42: I have a patch ready to submit, want me to hold off while you look at it? [16:24:31] jeh i may use that for testing if you are happy for me to merge it? [16:24:43] works for me, thanks :) [16:24:49] great thanks [16:26:36] (03PS7) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 [16:26:40] (03PS1) 10Jbond: Revert "puppet-merge test: test puppet-merge deployed to one puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/556723 [16:26:47] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppet-merge test: test puppet-merge deployed to one puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/556723 (owner: 10Jbond) [16:27:51] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond) [16:31:07] (03PS1) 10CDanis: dbctl: add 'PLACEHOLDER' as possible section master [software/conftool] - 10https://gerrit.wikimedia.org/r/556725 [16:34:16] (03PS1) 10Jbond: puppet-merge: fix lock detection [puppet] - 10https://gerrit.wikimedia.org/r/556726 [16:35:52] (03CR) 10Jbond: [C: 03+2] puppet-merge: fix lock detection [puppet] - 10https://gerrit.wikimedia.org/r/556726 (owner: 10Jbond) [16:39:12] 10Operations, 10observability, 10serviceops: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10fgiunchedi) rsyslog does indeed use librdkafka so it might be that! re: losing logs AFAICT that's not happening... [16:39:29] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10jijiki) First of all, we found that this fatal error is... [16:39:35] (03PS1) 10Jbond: Revert "Revert "puppet-merge test: test puppet-merge deployed to one puppet master"" [puppet] - 10https://gerrit.wikimedia.org/r/556728 [16:40:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Revert "puppet-merge test: test puppet-merge deployed to one puppet master"" [puppet] - 10https://gerrit.wikimedia.org/r/556728 (owner: 10Jbond) [16:41:59] (03PS1) 10Jbond: puppet-merge: actully exit [puppet] - 10https://gerrit.wikimedia.org/r/556729 [16:43:27] (03CR) 10Jbond: [C: 03+2] puppet-merge: actully exit [puppet] - 10https://gerrit.wikimedia.org/r/556729 (owner: 10Jbond) [16:45:34] (03PS1) 10Jbond: Revert "Revert "Revert "puppet-merge test: test puppet-merge deployed to one puppet master""" [puppet] - 10https://gerrit.wikimedia.org/r/556731 [16:45:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Revert "Revert "puppet-merge test: test puppet-merge deployed to one puppet master""" [puppet] - 10https://gerrit.wikimedia.org/r/556731 (owner: 10Jbond) [16:45:52] (03PS1) 10Ssingh: Add script for fetching routing information from RIPEstat [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/556732 [16:46:32] (03CR) 10jerkins-bot: [V: 04-1] Add script for fetching routing information from RIPEstat [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/556732 (owner: 10Ssingh) [16:47:48] (03PS1) 10Jbond: puppet-merge: add line feed to print statements [puppet] - 10https://gerrit.wikimedia.org/r/556733 [16:49:44] (03CR) 10Jbond: [C: 03+2] puppet-merge: add line feed to print statements [puppet] - 10https://gerrit.wikimedia.org/r/556733 (owner: 10Jbond) [16:50:30] elukey: the issue oyu saw is hopefully fixed now let me know if you see anything elses [16:50:59] (03PS2) 10Ssingh: Add script for fetching routing information from RIPEstat [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/556732 [16:51:34] (03CR) 10jerkins-bot: [V: 04-1] Add script for fetching routing information from RIPEstat [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/556732 (owner: 10Ssingh) [16:54:17] 10Puppet, 10Patch-For-Review, 10User-jbond: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10jbond) 05Open→03Resolved This is now complete all ops productions servers are on puppet 5 [16:54:44] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 (10Joe) It appears to me that we try to send something on... [16:55:10] (03CR) 10ArielGlenn: [C: 03+1] "assuming the kerberos timer syntax stuff is right, here's my thumbs up." [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [16:58:04] (03PS4) 10Jhedden: openstack: change cloudvirt1022 to ceph based virt role [puppet] - 10https://gerrit.wikimedia.org/r/556495 (https://phabricator.wikimedia.org/T239918) [16:58:40] (03PS8) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 [16:59:37] (03PS1) 10BBlack: dotls: define acme cert [puppet] - 10https://gerrit.wikimedia.org/r/556738 (https://phabricator.wikimedia.org/T239994) [16:59:39] (03PS1) 10BBlack: [WIP] dotls: main implementation [puppet] - 10https://gerrit.wikimedia.org/r/556739 (https://phabricator.wikimedia.org/T239994) [16:59:51] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond) [17:00:05] godog and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191212T1700). [17:00:05] tgr: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:01:43] _joe_: what do you think re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/546896 for puppet swat ? [17:01:57] <_joe_> godog: already reviewing [17:02:14] o/ [17:02:27] nice, thank you [17:05:17] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1123 - https://phabricator.wikimedia.org/T240534 (10wiki_willy) a:05wiki_willy→03Jclark-ctr @Jclark-ctr - looks like this one is still under warranty, so you should be able to just RMA it. Thanks, Willy [17:05:27] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The patch is still using crons, which we're deprecating in favour of profile::mediawiki::periodic_job which uses systemd timers and can be" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/546896 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [17:05:37] (03PS1) 10Jforrester: Change wikimaniawiki logo back to general version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556740 (https://phabricator.wikimedia.org/T240578) [17:05:56] <_joe_> tgr: the patch needs a bit of work, but I can work on it with rlazarus tomorrow morning and merge it for you once it's corrected [17:05:58] (03PS2) 10Jforrester: Change wikimaniawiki logo back to general version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556740 (https://phabricator.wikimedia.org/T240578) [17:06:00] (03PS2) 10Volans: netbox: add vlan support [software/homer] - 10https://gerrit.wikimedia.org/r/550375 (owner: 10Ayounsi) [17:06:14] <_joe_> unless this is extremely time-sensitive [17:07:00] <_joe_> we should probably put some documentation header at the top of profile::mediawiki::maintenance [17:07:03] <_joe_> sorry :/ [17:07:43] (03PS9) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 [17:08:51] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond) [17:10:51] (03PS10) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 [17:12:26] (03PS1) 10Jhedden: openstack: add cluster to virt_ceph hieradata [puppet] - 10https://gerrit.wikimedia.org/r/556742 (https://phabricator.wikimedia.org/T239918) [17:14:19] (03CR) 10Jhedden: [C: 03+2] openstack: add cluster to virt_ceph hieradata [puppet] - 10https://gerrit.wikimedia.org/r/556742 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [17:16:12] (03PS5) 10Jhedden: openstack: change cloudvirt1022 to ceph based virt role [puppet] - 10https://gerrit.wikimedia.org/r/556495 (https://phabricator.wikimedia.org/T239918) [17:17:17] (03CR) 10Jhedden: [C: 03+2] openstack: change cloudvirt1022 to ceph based virt role [puppet] - 10https://gerrit.wikimedia.org/r/556495 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [17:17:22] (03CR) 10Jbond: "Hi all" [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond) [17:20:58] (03PS1) 10Elukey: role::search::airflow: add kerberos config and keytab [puppet] - 10https://gerrit.wikimedia.org/r/556744 (https://phabricator.wikimedia.org/T236180) [17:21:23] (03PS3) 10Ssingh: Add script for fetching routing information from RIPEstat [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/556732 [17:23:26] (03PS2) 10Elukey: role::search::airflow: add kerberos config and keytab [puppet] - 10https://gerrit.wikimedia.org/r/556744 (https://phabricator.wikimedia.org/T236180) [17:24:51] (03PS3) 10Elukey: role::search::airflow: add kerberos config and keytab [puppet] - 10https://gerrit.wikimedia.org/r/556744 (https://phabricator.wikimedia.org/T236180) [17:31:07] (03PS1) 10Elukey: Add fake keytab for an-airflow1001 [labs/private] - 10https://gerrit.wikimedia.org/r/556746 [17:31:24] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake keytab for an-airflow1001 [labs/private] - 10https://gerrit.wikimedia.org/r/556746 (owner: 10Elukey) [17:35:44] (03PS4) 10Gergő Tisza: mediawiki: maintenance script for purging old GrowthExperiments data [puppet] - 10https://gerrit.wikimedia.org/r/546896 (https://phabricator.wikimedia.org/T208369) [17:36:00] 10Operations, 10Traffic, 10serviceops: Use Envoy instead of nginx for TLS termination on Appservers - https://phabricator.wikimedia.org/T240576 (10ema) [17:38:23] (03PS5) 10Gergő Tisza: mediawiki: maintenance script for purging old GrowthExperiments data [puppet] - 10https://gerrit.wikimedia.org/r/546896 (https://phabricator.wikimedia.org/T208369) [17:39:32] 10Operations, 10Traffic, 10serviceops: Use Envoy instead of nginx for TLS termination on Appservers - https://phabricator.wikimedia.org/T240576 (10ema) This is a severe case of PEBKAC: `curl` uses HTTP/2 by default, that's why the response has no TE:chunked. Forcing curl to use HTTP/1.1 we can see that inde... [17:39:37] 10Operations, 10Traffic, 10serviceops: Use Envoy instead of nginx for TLS termination on Appservers - https://phabricator.wikimedia.org/T240576 (10ema) p:05Triage→03Normal [17:41:29] _joe_: fixed, I think [17:55:04] (03PS5) 10Elukey: dumps::web::fetches::stats: move to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) [18:00:04] cscott, arlolra, subbu, halfak, and accraze: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191212T1800). [18:02:24] 10Operations, 10netops: Add cloudmetrics1002 to network devices ACL - https://phabricator.wikimedia.org/T240456 (10jcrespo) Hi, @Phamhi How urgent is this? Our netop is on vacations and will return next week. If it cannot wait, I can try to find someone to help you. [18:07:15] 10Operations, 10netops: Add cloudmetrics1002 to network devices ACL - https://phabricator.wikimedia.org/T240456 (10Phamhi) Hi @jcrespo , if he or she comes back early next week then it should be fine. [18:09:29] 10Operations, 10netops: Add cloudmetrics1002 to network devices ACL - https://phabricator.wikimedia.org/T240456 (10jcrespo) a:03ayounsi [18:09:38] 10Operations, 10netops: Add cloudmetrics1002 to network devices ACL - https://phabricator.wikimedia.org/T240456 (10jcrespo) p:05Triage→03High [18:17:39] 10Operations, 10Puppet, 10Packaging, 10User-jbond: Create a resources for installing components - https://phabricator.wikimedia.org/T240324 (10jcrespo) I am trying to triage, this seems like an internal "nice to have" feature, like T178575 but not something that probably requires immediate action. Maybe it... [18:18:17] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10jcrespo) a:05leila→03BBlack [18:21:37] 10Operations, 10Puppet, 10Packaging, 10User-jbond: Create a resources for installing components - https://phabricator.wikimedia.org/T240324 (10jbond) p:05Normal→03Low [18:22:02] 10Operations, 10Puppet, 10Packaging, 10User-jbond: Create a resources for installing components - https://phabricator.wikimedia.org/T240324 (10jbond) >>! In T240324#5737046, @jcrespo wrote: > I am trying to triage, this seems like an internal "nice to have" feature, like T178575 but not something that prob... [18:22:04] 10Operations, 10Continuous-Integration-Infrastructure, 10Gerrit, 10Release-Engineering-Team (CI & Testing services), and 2 others: Upload zuul_2.5.1-wmf11 to apt.wikimedia.org - https://phabricator.wikimedia.org/T240570 (10jcrespo) @hashar Just to be sure I understand the ticket (I am trying to process SRE... [18:23:19] !log arlolra@deploy1001 Started deploy [parsoid/deploy@5ba7506]: (no justification provided) [18:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:41] (03CR) 10Elukey: [C: 03+2] role::search::airflow: add kerberos config and keytab [puppet] - 10https://gerrit.wikimedia.org/r/556744 (https://phabricator.wikimedia.org/T236180) (owner: 10Elukey) [18:25:06] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@5ba7506]: (no justification provided) (duration: 01m 47s) [18:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:01] 10Operations, 10serviceops: High APCu fragmentation can impact server performance - https://phabricator.wikimedia.org/T240205 (10jcrespo) p:05Triage→03High [18:28:55] 10Operations, 10SRE-tools: wmf-auto-reimage errors: failure to downtime (w/ no rename), pytho gc whine - https://phabricator.wikimedia.org/T239897 (10jcrespo) p:05Normal→03Low Low (for now) based on Riccardo's comments. [18:30:19] 10Operations, 10netops: Facebook BGP peering links down in ulsfo - https://phabricator.wikimedia.org/T239896 (10jcrespo) Your reasoning seems ok to me, but we should CC @ayounsi of changes. [18:33:49] 10Operations, 10DC-Ops, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-jessie): Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10jcrespo) This ticket needs more @RobH. :-) [18:34:36] (03CR) 10Elukey: [C: 03+2] dumps::web::fetches::stats: move to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/556681 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [18:36:35] 10Operations, 10observability: Make grafana-next.wm.o HTTP 302 redirect to grafana.wm.o - https://phabricator.wikimedia.org/T240048 (10jcrespo) CC @fgiunchedi , although maybe it was someone else from Foundations that worked on this? [18:37:09] 10Operations, 10observability: Make grafana-next.wm.o HTTP 302 redirect to grafana.wm.o - https://phabricator.wikimedia.org/T240048 (10jcrespo) p:05Triage→03Normal [18:40:55] 10Operations, 10serviceops: High APCu fragmentation can impact server performance - https://phabricator.wikimedia.org/T240205 (10jcrespo) Question, is this fully implemented and only missing validation that nothing breaks (in which case I will reduce the priority), or is it still WIP? (just checking phabricato... [18:41:42] (03PS1) 10Elukey: dumps::web::fetches::analytics::job: add absolute path to bash [puppet] - 10https://gerrit.wikimedia.org/r/556757 (https://phabricator.wikimedia.org/T234229) [18:42:06] (03CR) 10Elukey: [C: 03+2] dumps::web::fetches::analytics::job: add absolute path to bash [puppet] - 10https://gerrit.wikimedia.org/r/556757 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [18:42:28] oh, woops :-) [18:42:46] yeah I always forget :( [18:45:38] apergos: check systemctl list-timers, looks nice now :) [18:45:54] I am manually running one rsync/service unit [18:47:01] ah there they are :-) [18:47:34] makes me want to reach out and pat their little fuzzy heads (if they had any) [18:47:37] so cute! [18:48:16] if you want to check the same command on an-coord1001 you can see how it looks like for analytics [18:52:59] cdanis: looking at noc@ email, how does it feel being on the other side? :-D [18:53:11] 😤 [18:53:18] oh I will check, great idea! [18:53:43] cdanis: I loved that first line, followed by clearly a template [18:53:46] :-D [18:55:22] oh my, that's a full set!! [18:55:25] I am pretty sure that cdanis used some secret word to reach a human [18:55:32] that we don't know yet [18:56:04] apergos: those were all crons an year ago, then we moved to timers [18:56:10] and never looked back :D [18:56:15] no I waited four days for the wrong team to look at it elukey [18:56:17] I should thnk about moving some of the dumps crons over [18:56:40] (03PS3) 10C. Scott Ananian: Make Parsoid/PHP cluster read-write to record lints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556001 (https://phabricator.wikimedia.org/T237326) (owner: 10Arlolra) [18:56:52] apergos: long term it would be awesome to use apache airflow for some of these thigns [18:57:21] cdanis: I didn't say the right human for the job :D [18:58:18] my ones are all once a day or once a week so it's actually fine to just have the timers [18:58:47] also mine are all independent [18:59:14] the airflow stuff would be for the main mess that has a coordinating bash script for it now, hiding all the awfulness [18:59:26] (03CR) 10C. Scott Ananian: "I'd feel a little better about this if we were just turning on write access to the DB in a little window around the linter hook, to be hon" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556001 (https://phabricator.wikimedia.org/T237326) (owner: 10Arlolra) [19:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191212T1900). [19:00:04] matthiasmullie: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:12] o/ [19:01:39] matthiasmullie: I can SWAT today! [19:01:58] cscott: arlolra: I see you have "patch to come" in current SWAT slot, should I count with you? [19:02:54] matthiasmullie: or you want to deploy that yourself? [19:03:04] (feel free to) [19:03:11] (03PS1) 10Bstorm: toolforge-k8s: harden the init config a bit [puppet] - 10https://gerrit.wikimedia.org/r/556767 (https://phabricator.wikimedia.org/T240009) [19:03:24] erm [19:03:33] yeah I'll do it myself, no need to waste anyone's time [19:04:19] ack [19:05:07] (03Abandoned) 10Bstorm: toolforge-kubernetes: disable profiling on api servers [puppet] - 10https://gerrit.wikimedia.org/r/555634 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [19:09:07] (03CR) 10Bstorm: "Deploying in place in toolsbeta before merge *just* in case :)" [puppet] - 10https://gerrit.wikimedia.org/r/556404 (https://phabricator.wikimedia.org/T239405) (owner: 10Bstorm) [19:13:12] Urbanecm: yeah, it's https://gerrit.wikimedia.org/r/556001 [19:13:29] we were just checking in w/ each other to make sure we're ready to go w/ it [19:15:31] I see, thanks. matthiasmullie is currently doing their stuff (soon-to-be-merged backport), so let's do that after [19:16:02] ok. arlolra is online too to monitor [19:16:21] ack [19:20:14] 10Operations, 10serviceops: High APCu fragmentation can impact server performance - https://phabricator.wikimedia.org/T240205 (10jijiki) 05Open→03Resolved I think we can mark this as resolved, our solution seems to be working. If something breaks, we will open a new task or reopen this one. Thank you! [19:24:22] (03CR) 10Urbanecm: [C: 03+1] Change wikimaniawiki logo back to general version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556740 (https://phabricator.wikimedia.org/T240578) (owner: 10Jforrester) [19:33:49] matthiasmullie: Patch seems to just merged. [19:34:29] !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/WikibaseMediaInfo/WikibaseMediaInfo.entitytypes.php: Register mediainfo-specific EntityIdLookup (duration: 01m 04s) [19:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:41] Urbanecm: done! (finally :o) [19:34:53] thanks! [19:34:59] cscott: want to deploy yourself, or should I? [19:35:47] cc arlolra [19:35:50] best if you did it, i haven't deployed a config change in a long time. at least a year. [19:36:17] okay, no problem [19:36:39] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 9.725 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:36:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:36:42] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556001 (https://phabricator.wikimedia.org/T237326) (owner: 10Arlolra) [19:37:36] (03Merged) 10jenkins-bot: Make Parsoid/PHP cluster read-write to record lints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556001 (https://phabricator.wikimedia.org/T237326) (owner: 10Arlolra) [19:38:00] cscott: arlolra: Could you test the patch at mwdebug1001, please? [19:38:10] on it [19:40:39] (03PS1) 10Bstorm: toolforge-k8s: harden the kubelet a bit [puppet] - 10https://gerrit.wikimedia.org/r/556782 (https://phabricator.wikimedia.org/T240009) [19:42:41] Urbanecm: seems to work [19:42:51] thanks, syncing [19:43:02] Urbanecm: config change mostly affects the parsoid cluster, but I tested that I haven't totally broken the Lint extension on the main cluster [19:43:20] ack [19:44:04] (03PS2) 10Bstorm: toolforge-k8s: harden the kubelet a bit [puppet] - 10https://gerrit.wikimedia.org/r/556782 (https://phabricator.wikimedia.org/T240009) [19:44:49] matthiasmullie: it appears that your change is raising some errors [19:44:51] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: ffe365e: Make Parsoid/PHP cluster read-write to record lints (T237326, T240057) (duration: 01m 02s) [19:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:57] T237326: Make Parsoid/PHP cluster read-write to ensure lints discovered by Parsoid/PHP are stored in the DB - https://phabricator.wikimedia.org/T237326 [19:45:07] cscott: done [19:45:09] matthiasmullie: https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=h@44136fa&_a=h@222f6b1 [19:45:47] Urbanecm: ok, testing [19:46:07] effie: at my side, the link says "Unable to completely restore the URL, be sure to use the share functionality" [19:46:22] (03CR) 10Bstorm: "This does the thing: https://puppet-compiler.wmflabs.org/compiler1003/19958/tools-k8s-worker-1.tools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/556782 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [19:46:33] +1, but I do get to see the errors [19:47:06] Urbanecm: oh wait [19:47:26] Urbanecm: https://logstash.wikimedia.org/goto/c1f8e707228a40973f5d2005e6fba665 does this work ? [19:47:46] meh, not sure what's going on there - I'll revert and investigate tomorrow [19:47:51] yup [19:47:59] thank you! [19:50:10] Urbanecm: let me know when you're done, then I'll revert [19:50:18] matthiasmullie: sorry, go for it! [19:55:45] Urbanecm: yep, all looks good for us. thanks! [19:55:51] yw! [19:55:52] !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/WikibaseMediaInfo/WikibaseMediaInfo.entitytypes.php: Revert: Register mediainfo-specific EntityIdLookup (duration: 01m 01s) [19:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:51] Urbanecm: revert done [19:56:57] ack [19:57:04] given nothing else is in our calendar [19:57:08] !log Morning SWAT done [19:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:45] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:00:04] marxarelli and James_F: That opportune time is upon us again. Time for a Mediawiki train - American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191212T2000). [20:02:19] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.04167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:02:55] !log pool maps1001 - postgres re-init is complete - T239728 [20:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:01] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [20:08:40] (03CR) 10Phamhi: [C: 03+1] toolforge-k8s: harden the kubelet a bit [puppet] - 10https://gerrit.wikimedia.org/r/556782 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [20:16:15] (03PS1) 10Dduvall: all wikis to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556794 [20:16:17] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556794 (owner: 10Dduvall) [20:17:02] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556794 (owner: 10Dduvall) [20:17:09] !log T229686 adding instances backing es1/es2/es3/x1 to dbctl's instance data [20:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:16] T229686: #dbctl: manage 'externalLoads' data - https://phabricator.wikimedia.org/T229686 [20:18:50] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.10 [20:18:50] !log T229686 adding sections es1/es2/es3/x1 to dbctl's section data [20:18:51] (03CR) 10BBlack: [C: 03+2] sec-warning: handle non-GET better [puppet] - 10https://gerrit.wikimedia.org/r/556674 (https://phabricator.wikimedia.org/T238038) (owner: 10BBlack) [20:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:24] !log cdanis@cumin2001 dbctl commit (dc=all): 'T229686 add sections es1/es2/es3/x1 and their instances', diff saved to https://phabricator.wikimedia.org/P9866 and previous config saved to /var/cache/conftool/dbconfig/20191212-202023-cdanis.json [20:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:11] (03CR) 10BBlack: [C: 03+2] dotls: define acme cert [puppet] - 10https://gerrit.wikimedia.org/r/556738 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [20:25:21] PROBLEM - Disk space on netflow2001 is CRITICAL: DISK CRITICAL - free space: / 302 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [20:26:57] (03PS1) 10CDanis: conftool-data: add es[123]|x1 sections and instances [puppet] - 10https://gerrit.wikimedia.org/r/556800 (https://phabricator.wikimedia.org/T229686) [20:27:54] (03CR) 10CDanis: [C: 03+2] conftool-data: add es[123]|x1 sections and instances [puppet] - 10https://gerrit.wikimedia.org/r/556800 (https://phabricator.wikimedia.org/T229686) (owner: 10CDanis) [20:29:13] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [20:29:18] known, being fixed [20:29:19] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [20:31:13] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [20:31:19] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [20:39:45] I'm taking prod config for a bit. [20:39:52] (03PS3) 10Jforrester: Change wikimaniawiki logo back to general version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556740 (https://phabricator.wikimedia.org/T240578) [20:39:57] (03CR) 10Jforrester: [C: 03+2] Change wikimaniawiki logo back to general version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556740 (https://phabricator.wikimedia.org/T240578) (owner: 10Jforrester) [20:40:17] (03PS2) 10Jforrester: Enable the Wikisource extension on all Wikisources except old Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556639 (https://phabricator.wikimedia.org/T240546) (owner: 10Tpt) [20:40:50] (03Merged) 10jenkins-bot: Change wikimaniawiki logo back to general version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556740 (https://phabricator.wikimedia.org/T240578) (owner: 10Jforrester) [20:41:15] (03PS3) 10Jforrester: Enable the Wikisource extension on all Wikisources except old Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556639 (https://phabricator.wikimedia.org/T240546) (owner: 10Tpt) [20:41:20] (03CR) 10Jforrester: [C: 03+2] Enable the Wikisource extension on all Wikisources except old Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556639 (https://phabricator.wikimedia.org/T240546) (owner: 10Tpt) [20:42:17] (03Merged) 10jenkins-bot: Enable the Wikisource extension on all Wikisources except old Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556639 (https://phabricator.wikimedia.org/T240546) (owner: 10Tpt) [20:43:01] !log volker-e@deploy1001 Started deploy [design/style-guide@311d22e]: Deploy design/style-guide: [20:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:07] 10Operations, 10Traffic: Fix acme-chief DNS validation correctly - https://phabricator.wikimedia.org/T240614 (10BBlack) p:05Triage→03High [20:43:08] !log volker-e@deploy1001 Finished deploy [design/style-guide@311d22e]: Deploy design/style-guide: (duration: 00m 07s) [20:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:57] !log jforrester@deploy1001 Synchronized static/images/project-logos/wikimaniawiki-2x.png: T240578 Change wikimaniawiki logo back to general version, 2x (duration: 00m 56s) [20:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:03] T240578: Change Wikimania wiki's logo back to the standard Wikimania logo - https://phabricator.wikimedia.org/T240578 [20:45:12] (03PS1) 10BBlack: Temp fixup for acme_chief challenge validation [puppet] - 10https://gerrit.wikimedia.org/r/556806 (https://phabricator.wikimedia.org/T240614) [20:45:17] !log jforrester@deploy1001 Synchronized static/images/project-logos/wikimaniawiki-1.5x.png: T240578 Change wikimaniawiki logo back to general version, 1.5x (duration: 00m 55s) [20:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:10] 10Operations, 10ops-eqiad: setup/install censorship1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10Jclark-ctr) Labeled Host swap out the dual 480GB SSDs out for the dual 2TB SATA disks [20:46:23] !log jforrester@deploy1001 Synchronized static/images/project-logos/wikimaniawiki.png: T240578 Change wikimaniawiki logo back to general version, 1x (duration: 00m 56s) [20:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:30] 10Operations, 10ops-eqiad: setup/install censorship1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10Jclark-ctr) [20:47:00] (03CR) 10BBlack: [C: 03+2] Temp fixup for acme_chief challenge validation [puppet] - 10https://gerrit.wikimedia.org/r/556806 (https://phabricator.wikimedia.org/T240614) (owner: 10BBlack) [20:47:23] 10Operations, 10ops-eqiad: setup/install censorship1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10Jclark-ctr) a:05Jclark-ctr→03RobH [20:48:20] jclark-ctr: thanks! [20:48:51] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T240546 Enable the Wikisource extension on all Wikisources except old Wikisource (duration: 00m 57s) [20:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:57] T240546: Deploy the Wikisource extension on all Wikisources - https://phabricator.wikimedia.org/T240546 [20:49:48] (03PS3) 10Jforrester: Stop setting wgSpamBlacklistEventLogging, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554978 [20:49:54] (03CR) 10Jforrester: [C: 03+2] Stop setting wgSpamBlacklistEventLogging, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554978 (owner: 10Jforrester) [20:50:02] (03PS3) 10Jforrester: Drop wgMediaInfoEnableOtherStatements and wgDepictsQualifierProperties, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554979 [20:50:08] (03CR) 10Jforrester: [C: 03+2] Drop wgMediaInfoEnableOtherStatements and wgDepictsQualifierProperties, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554979 (owner: 10Jforrester) [20:50:16] (03PS4) 10Jforrester: Drop wgDisableRollbackConfirmationFeature, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554980 [20:50:21] (03CR) 10Jforrester: [C: 03+2] Drop wgDisableRollbackConfirmationFeature, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554980 (owner: 10Jforrester) [20:50:58] (03Merged) 10jenkins-bot: Stop setting wgSpamBlacklistEventLogging, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554978 (owner: 10Jforrester) [20:51:14] (03Merged) 10jenkins-bot: Drop wgMediaInfoEnableOtherStatements and wgDepictsQualifierProperties, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554979 (owner: 10Jforrester) [20:51:38] (03Merged) 10jenkins-bot: Drop wgDisableRollbackConfirmationFeature, unread [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554980 (owner: 10Jforrester) [20:52:39] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Drop wgSpamBlacklistEventLogging, no longer read (duration: 00m 58s) [20:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:54] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Drop wgMediaInfoEnableOtherStatements, wgDepictsQualifierProperties, and wgDisableRollbackConfirmationFeature (duration: 00m 58s) [20:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:04] (03PS1) 10RobH: censorship1001.eqiad.wmnet dns updates [dns] - 10https://gerrit.wikimedia.org/r/556808 (https://phabricator.wikimedia.org/T239250) [20:56:43] (03CR) 10RobH: [C: 03+2] censorship1001.eqiad.wmnet dns updates [dns] - 10https://gerrit.wikimedia.org/r/556808 (https://phabricator.wikimedia.org/T239250) (owner: 10RobH) [20:57:49] (03PS2) 10BBlack: dotls: main implementation [puppet] - 10https://gerrit.wikimedia.org/r/556739 (https://phabricator.wikimedia.org/T239994) [20:57:51] (03PS1) 10BBlack: dotls: test on dns4002 [puppet] - 10https://gerrit.wikimedia.org/r/556809 (https://phabricator.wikimedia.org/T239994) [20:58:16] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10RobH) 05Open→03Resolved Setup is being done via T239250, resolving task. [20:58:27] (03PS3) 10BBlack: dotls: main implementation [puppet] - 10https://gerrit.wikimedia.org/r/556739 (https://phabricator.wikimedia.org/T239994) [20:58:29] (03PS2) 10BBlack: dotls: test on dns4002 [puppet] - 10https://gerrit.wikimedia.org/r/556809 (https://phabricator.wikimedia.org/T239994) [21:01:29] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install censorship1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10RobH) [21:02:32] James_F: are you running the train? [21:02:58] we'd like to make an end-of-the-day bug-fix deploy of parsoid, wanted to make sure we weren't going to step on ops toes [21:05:27] (03CR) 10BBlack: [C: 03+2] dotls: main implementation [puppet] - 10https://gerrit.wikimedia.org/r/556739 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [21:05:32] (03CR) 10BBlack: [C: 03+2] dotls: test on dns4002 [puppet] - 10https://gerrit.wikimedia.org/r/556809 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [21:06:47] marxarelli, James_F is the american train done? [21:07:02] cscott: done! [21:07:16] nothing's on fire, no reason we can't do a quick parsoid deploy? [21:07:54] i won't stop you :) [21:08:01] great, thanks [21:09:17] (03PS3) 10Bstorm: toolforge-k8s: harden the kubelet a bit [puppet] - 10https://gerrit.wikimedia.org/r/556782 (https://phabricator.wikimedia.org/T240009) [21:12:06] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: harden the kubelet a bit [puppet] - 10https://gerrit.wikimedia.org/r/556782 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [21:14:53] (03PS1) 10BBlack: dotls: fix listen specs [puppet] - 10https://gerrit.wikimedia.org/r/556814 (https://phabricator.wikimedia.org/T239994) [21:16:50] (03CR) 10BBlack: [C: 03+2] dotls: fix listen specs [puppet] - 10https://gerrit.wikimedia.org/r/556814 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [21:23:55] !log arlolra@deploy1001 Started deploy [parsoid/deploy@75d72e8]: Updating Parsoid to 28d7c21 [21:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:29] (03CR) 10Bstorm: "Funny thing about this one is it will only affect a new server much. The changes to the manifests need to be done one at a time to keep th" [puppet] - 10https://gerrit.wikimedia.org/r/556767 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [21:25:03] (03PS2) 10Bstorm: nginx-ingress: Have ingress pods request realistic resting resources [puppet] - 10https://gerrit.wikimedia.org/r/556404 (https://phabricator.wikimedia.org/T239405) [21:26:18] (03CR) 10Bstorm: [C: 03+2] nginx-ingress: Have ingress pods request realistic resting resources [puppet] - 10https://gerrit.wikimedia.org/r/556404 (https://phabricator.wikimedia.org/T239405) (owner: 10Bstorm) [21:27:59] 10Operations, 10Traffic, 10Patch-For-Review: Implement DNS-over-TLS for AuthDNS - https://phabricator.wikimedia.org/T239994 (10BBlack) P9867 <- First internal test query on a prod dns box :) [21:31:37] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@75d72e8]: Updating Parsoid to 28d7c21 (duration: 07m 41s) [21:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:06] (03PS1) 10Bstorm: toolforge-k8s: fix a missing space in the config [puppet] - 10https://gerrit.wikimedia.org/r/556820 [21:37:07] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: fix a missing space in the config [puppet] - 10https://gerrit.wikimedia.org/r/556820 (owner: 10Bstorm) [21:38:40] (03PS1) 10BBlack: dotls: simpler and clearer listen config [puppet] - 10https://gerrit.wikimedia.org/r/556821 (https://phabricator.wikimedia.org/T239994) [21:40:54] (03CR) 10BBlack: [C: 03+2] dotls: simpler and clearer listen config [puppet] - 10https://gerrit.wikimedia.org/r/556821 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [21:57:01] (03PS1) 10RobH: censorship1001.eqiad.wmnet setup [puppet] - 10https://gerrit.wikimedia.org/r/556823 (https://phabricator.wikimedia.org/T239250) [22:05:22] (03PS1) 10BBlack: dotls: add ferm and NRPE monitoring via kdig [puppet] - 10https://gerrit.wikimedia.org/r/556827 (https://phabricator.wikimedia.org/T239994) [22:11:02] (03CR) 10BBlack: [C: 03+2] dotls: add ferm and NRPE monitoring via kdig [puppet] - 10https://gerrit.wikimedia.org/r/556827 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [22:35:55] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Jdforrester-WMF) @hashar, rather than blocking the migration to bust... [22:38:09] (03Abandoned) 10Jdlrobson: Enable MFMobileMainPageCss on Hindi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421184 (https://phabricator.wikimedia.org/T190101) (owner: 10Jdlrobson) [22:40:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Jclark-ctr) [22:43:21] (03PS1) 10BBlack: dotls: haproxy gdnsd dep and smooth reloads [puppet] - 10https://gerrit.wikimedia.org/r/556831 (https://phabricator.wikimedia.org/T239994) [22:48:52] (03CR) 10BBlack: [C: 03+2] dotls: haproxy gdnsd dep and smooth reloads [puppet] - 10https://gerrit.wikimedia.org/r/556831 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [22:58:44] (03PS1) 10BBlack: dotls: glue haproxy to gdnsd in systemd [puppet] - 10https://gerrit.wikimedia.org/r/556833 (https://phabricator.wikimedia.org/T239994) [23:03:04] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [23:03:34] (03CR) 10BBlack: [C: 03+2] dotls: glue haproxy to gdnsd in systemd [puppet] - 10https://gerrit.wikimedia.org/r/556833 (https://phabricator.wikimedia.org/T239994) (owner: 10BBlack) [23:03:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [23:04:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [23:10:42] 10Operations, 10Traffic: Implement DNS-over-TLS for AuthDNS - https://phabricator.wikimedia.org/T239994 (10BBlack) This is now mostly-working, with heira flag controlling test deployment (currently only on dns4002, which doesn't have any public authserver IPs routed into it at this time). Reminders on the nex... [23:13:19] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [23:15:06] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [23:16:03] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [23:17:13] 10Operations, 10ops-eqiad, 10Analytics, 10decommission: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [23:32:15] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10RobH) [23:44:46] (03CR) 10Jhedden: [C: 03+1] "Looks good, question on the garbage collection but not blockers" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556767 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [23:46:31] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:51:20] (03CR) 10Jhedden: [C: 03+1] nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [23:53:18] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite)