[00:00:04] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200702T0000). [00:16:25] PROBLEM - snapshot of s4 in eqiad on db2093 is CRITICAL: snapshot for s4 at eqiad taken more than 3 days ago: Most recent backup 2020-06-28 23:47:14 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:18:01] PROBLEM - snapshot of s4 in codfw on db2093 is CRITICAL: snapshot for s4 at codfw taken more than 3 days ago: Most recent backup 2020-06-28 23:57:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:48:11] PROBLEM - MariaDB Replica Lag: m1 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1075.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:48:25] PROBLEM - MariaDB Replica Lag: m1 on db1117 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1089.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:49:03] PROBLEM - MariaDB Replica Lag: m1 on db2132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1126.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:49:09] PROBLEM - MariaDB Replica Lag: m1 on db1080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1133.72 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:03:47] RECOVERY - MariaDB Replica Lag: m1 on db2132 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:04:59] RECOVERY - MariaDB Replica Lag: m1 on db1117 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:09:29] PROBLEM - snapshot of s2 in eqiad on db2093 is CRITICAL: snapshot for s2 at eqiad taken more than 3 days ago: Most recent backup 2020-06-29 00:42:49 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:13:05] RECOVERY - MariaDB Replica Lag: m1 on db1080 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:19:01] 10Operations, 10Analytics-Radar, 10Wikimedia-Logstash, 10observability, 10Performance-Team (Radar): Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10Krinkle) I'm confused.. I thought we were already on the Kafka pipeline with udp2l... [01:24:35] PROBLEM - snapshot of s2 in codfw on db2093 is CRITICAL: snapshot for s2 at codfw taken more than 3 days ago: Most recent backup 2020-06-29 00:59:53 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:30:40] RECOVERY - MariaDB Replica Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:56:03] PROBLEM - snapshot of s5 in eqiad on db2093 is CRITICAL: snapshot for s5 at eqiad taken more than 3 days ago: Most recent backup 2020-06-29 02:43:33 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:59:24] (03CR) 10Bmansurov: Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [03:03:59] PROBLEM - snapshot of s5 in codfw on db2093 is CRITICAL: snapshot for s5 at codfw taken more than 3 days ago: Most recent backup 2020-06-29 02:34:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:10:25] (03CR) 10Thcipriani: [C: 03+1] Fixed a minor typo [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/608929 (owner: 10Ahmon Dancy) [03:10:57] (03CR) 10Thcipriani: [C: 03+1] Fixed paths to SETUP.* files in README [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/608928 (owner: 10Ahmon Dancy) [03:52:11] PROBLEM - snapshot of s7 in eqiad on db2093 is CRITICAL: snapshot for s7 at eqiad taken more than 3 days ago: Most recent backup 2020-06-29 03:45:42 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:14:49] PROBLEM - snapshot of s7 in codfw on db2093 is CRITICAL: snapshot for s7 at codfw taken more than 3 days ago: Most recent backup 2020-06-29 03:57:55 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:18:50] PROBLEM - snapshot of s6 in codfw on db2093 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2020-06-29 04:11:46 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:26:27] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:27:25] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:28:17] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:31:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:32:29] PROBLEM - snapshot of s6 in eqiad on db2093 is CRITICAL: snapshot for s6 at eqiad taken more than 3 days ago: Most recent backup 2020-06-29 04:21:50 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:37:07] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:53:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:58:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:33:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:35:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:43:49] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:45:37] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:46:33] <_joe_> !log upload docker-report 0.0.4 on buster-wikimedia T242604 [05:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:38] T242604: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 [05:50:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:51:09] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:52:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:46] <_joe_> !log removing all tags for envoy-tls-local-proxy [05:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:00] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:54:51] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 2 others: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10Joe) Docker-report can now support such cases, and I removed the tags for that repository. [05:59:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:23] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:02:13] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:06:52] 10Operations, 10ops-eqiad: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10ops-monitoring-bot) [06:07:46] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10Majavah) [06:08:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:44] (03PS5) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [06:14:11] (03CR) 10jerkins-bot: [V: 04-1] Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [06:14:47] (03PS2) 10ZPapierski: Authenticate with MW oauth 1.0a for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) [06:15:43] (03Abandoned) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608824 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [06:25:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:43] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:33:33] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:35:05] PROBLEM - snapshot of x1 in eqiad on db2093 is CRITICAL: snapshot for x1 at eqiad taken more than 3 days ago: Most recent backup 2020-06-29 06:08:07 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:36:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:41:48] (03PS1) 10KartikMistry: Update cxserver to 2020-07-01-044435-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/609058 (https://phabricator.wikimedia.org/T254143) [06:46:30] (03PS2) 10KartikMistry: Update cxserver to 2020-07-01-044435-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/609058 (https://phabricator.wikimedia.org/T254143) [06:51:51] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:53:01] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:53:41] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:54:09] PROBLEM - snapshot of s3 in codfw on db2093 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2020-06-29 06:31:58 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200702T0700) [07:05:55] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:07] (03PS13) 10Privacybatm: transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [07:17:35] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:18:21] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] "Thanks!" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/608928 (owner: 10Ahmon Dancy) [07:19:07] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] "Thanks!" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/608929 (owner: 10Ahmon Dancy) [07:20:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:21:39] (03PS12) 10Elukey: Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [07:22:11] (03PS6) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [07:22:37] PROBLEM - snapshot of s3 in eqiad on db2093 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2020-06-29 06:49:36 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [07:23:07] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:28] (03CR) 10jerkins-bot: [V: 04-1] Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [07:25:42] (03PS3) 10ZPapierski: Authenticate with MW oauth 1.0a for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) [07:31:12] (03CR) 10Elukey: [C: 03+2] Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:31:25] marostegui: --^ [07:31:29] (03PS7) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [07:32:35] (03PS4) 10ZPapierski: Authenticate with MW oauth 1.0a for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) [07:32:42] (03CR) 10jerkins-bot: [V: 04-1] Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [07:33:57] (03PS8) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [07:40:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:40:59] (03PS1) 10Muehlenhoff: Remove obsolete pinning for liblwloc [puppet] - 10https://gerrit.wikimedia.org/r/609100 (https://phabricator.wikimedia.org/T256877) [07:44:59] kormat: o/ [07:45:13] I merged the patch for db1108, IIRC you wanted a ping [07:45:20] (still setting up things etc..) [07:45:44] oh yeah. thanks :) (i had seen the mail, wondered why i was on the cc list, and archived it :) [07:45:48] (03PS1) 10Jcrespo: mariadb-backups: Fix remote_backup_mariadb.py deps and path [puppet] - 10https://gerrit.wikimedia.org/r/609101 (https://phabricator.wikimedia.org/T138562) [07:46:45] (03PS2) 10Jcrespo: mariadb-backups: Fix remote_backup_mariadb.py deps and path [puppet] - 10https://gerrit.wikimedia.org/r/609101 (https://phabricator.wikimedia.org/T138562) [07:48:47] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Fix remote_backup_mariadb.py deps and path [puppet] - 10https://gerrit.wikimedia.org/r/609101 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [07:50:05] kormat: spam from analytics people, yeah annoying, Manuel feels the same every time I ping him I am sure :D [07:50:11] :D [07:52:19] PROBLEM - snapshot of x1 in codfw on db2093 is CRITICAL: snapshot for x1 at codfw taken more than 3 days ago: Most recent backup 2020-06-29 07:23:51 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:03:01] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:39] (03PS1) 10Muehlenhoff: Unconditionally install systemd packages [puppet] - 10https://gerrit.wikimedia.org/r/609104 (https://phabricator.wikimedia.org/T256877) [08:05:43] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: switch to new names for global availability metrics [puppet] - 10https://gerrit.wikimedia.org/r/608319 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [08:05:52] (03PS2) 10Filippo Giunchedi: monitoring: switch to new names for global availability metrics [puppet] - 10https://gerrit.wikimedia.org/r/608319 (https://phabricator.wikimedia.org/T233956) [08:10:21] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10danshick-wmde) Thank you @Dzahn ! I am often flummoxed by usernames with spaces in 'em, but this worked. Much appreciated -- [08:10:42] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10hashar) I am wondering how to find packages that got installed from `stretch-backports`. Our base images have backports configured and one can borrow a package from it when it does not exist in s... [08:13:43] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 (10ayounsi) 05Open→03Resolved a:03ayounsi Netbox updated. [08:13:51] (03CR) 10Hashar: [C: 03+1] "That is one of the edge case in which copy pasting is a good thing :] Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/608953 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [08:14:42] 10Operations, 10ops-eqsin, 10DC-Ops, 10netops: (Need By: TBD) rack/setup/install cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T253246 (10ayounsi) 05Open→03Resolved I believe there is nothing left to do here, see T255766. [08:14:44] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin to production - https://phabricator.wikimedia.org/T255766 (10ayounsi) [08:15:06] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) a:03jijiki [08:15:24] (03CR) 10Hashar: [C: 03+1] gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn) [08:16:09] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10Kormat) I'm at a loss here. There's nothing recent in the ilo log. This is the last entry: ` /system1/log1/record36 Targets Properties number=36 severity=Caution date=06/18/2020 time=... [08:16:55] (03CR) 10Filippo Giunchedi: [C: 03+1] purged: alert if local backlog grows past the given limits [puppet] - 10https://gerrit.wikimedia.org/r/608564 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [08:17:29] 10Operations, 10ops-eqsin: Decommission cr1-eqsin - https://phabricator.wikimedia.org/T256947 (10ayounsi) p:05Triage→03Low [08:17:32] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10jcrespo) It says: Battery count: 0. Did the battery die or did it come back after it failed (or detected failed)? [08:18:48] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:20:11] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) >>! In T256877#6273759, @hashar wrote: > I am wondering how to find packages that got installed from `stretch-backports`. Any package from backports should follow the "~bpo9+... [08:20:28] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10Kormat) The battery has been dead for a long time {T225391}, it's a [[ https://github.com/wikimedia/puppet/blob/4d037cb8debde241870cee57a7ecb39e2b718f25/hieradata/hosts/db1077.yaml#L2 | known issue ]]. [08:21:10] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10jcrespo) Apparently this is known issue: T226519 Maybe then auto-icinga-ack should be disabled for this host? [08:22:53] elukey: while sections have an arbitrary name "meta" seems like too generic (easily confusing with metawiki database). Maybe analytics-meta or something else? [08:23:40] jynus: sure no problem! [08:23:51] do you have a minute for a multi-instance qs? [08:23:54] sure [08:24:20] how are the mariadb@xyz.service units created? [08:24:35] they are not physically created [08:24:37] I checked on db1117 and I see them, but not for db1108 [08:24:49] there is mariadb@ unit, that can receive parameters [08:25:09] the package takes care [08:25:13] if they exist that must be wrong [08:25:34] let me show you [08:25:35] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, LGTM otherwise!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608417 (owner: 10Jbond) [08:25:54] super, so after running mysql_install_db I just need to systemctl start mariadb@matomo/analytics-meta with params [08:26:13] correct [08:26:23] plus you can add specific configuration [08:26:34] (03CR) 10Filippo Giunchedi: [C: 04-1] "-1 to block this on Iaa969d0fba to resolve vhost diffs" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [08:26:36] on its own my.section.cnf [08:26:38] if needed [08:26:49] we normally only change the memory assigned [08:26:55] but you could add more things [08:27:46] (03CR) 10Filippo Giunchedi: [C: 03+1] scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 (owner: 10Jbond) [08:28:19] when you said that you saw them on db1117, I am guessing you saw them on systemct, not on the filesystem, right? [08:28:24] (03CR) 10Filippo Giunchedi: [C: 03+1] Make Graphite httpd site configurable [puppet] - 10https://gerrit.wikimedia.org/r/608881 (owner: 10Muehlenhoff) [08:28:51] elukey: cat /lib/systemd/system/mariadb@.service [08:28:55] (03PS5) 10ZPapierski: Authenticate with MW oauth 1.0a for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) [08:29:06] and that is the parametized systemd unit that all mariadb@something will use [08:29:12] and the parameter will be used as a template [08:29:14] jynus: yes yes I found it but didn't know how to start specific instances, very nice [08:29:16] (03PS1) 10Elukey: Rename analytics meta mariadb instance for Backup host [puppet] - 10https://gerrit.wikimedia.org/r/609106 (https://phabricator.wikimedia.org/T234826) [08:29:33] that is a systemd feature really, not maridb specific [08:29:45] but I think only us, prometheus a a couple of things uses it [08:29:49] *and [08:30:00] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: add tests for filter-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/608061 (https://phabricator.wikimedia.org/T251869) (owner: 10Filippo Giunchedi) [08:30:05] prometheus will have its own service parametrized [08:30:17] so you will have prometheus-mysqld-exporter@matomo, etc [08:30:18] yep perfect [08:30:38] you can setup the systemd service on puppet if you want [08:30:39] can you check my change above to see if the naming is ok? [08:30:47] (03CR) 10jerkins-bot: [V: 04-1] Rename analytics meta mariadb instance for Backup host [puppet] - 10https://gerrit.wikimedia.org/r/609106 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [08:30:51] ahaha nope [08:30:54] lemme fix it first [08:30:56] we don't start it manually for reasons, but doesn't mean it cannot be setup on non-core systems [08:31:27] fine to start manually! [08:31:59] (03PS3) 10JMeybohm: Add patches for swift auth and bind interface [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/608088 (https://phabricator.wikimedia.org/T253843) [08:32:11] we have a mariadb optional service module that takes care of that [08:32:13] if needed [08:32:35] are you on purpose leaving memory available for more instances? [08:32:44] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: set consistency-delay on store [puppet] - 10https://gerrit.wikimedia.org/r/607783 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [08:32:58] 300GB will have around 100G unused out of 512GB server [08:33:20] actually 200GB available, I missread [08:33:23] is that ok? [08:33:24] (03PS2) 10Elukey: Rename analytics meta mariadb instance for Backup host [puppet] - 10https://gerrit.wikimedia.org/r/609106 (https://phabricator.wikimedia.org/T234826) [08:33:29] jynus: yes that is the idea, I think that more may come in the future and the actual ones should be small.. but we can tune the memory usage if needed, I am basically trying out settings now [08:33:46] that is cool [08:33:53] we have some hosts there [08:33:55] just FYI [08:34:43] so sorry about meta, but for example, staging is generic but has not wiki couterpart [08:34:56] but meta really is confusing with meta.wikimedia.org [08:35:04] 10Operations, 10ops-eqsin: Decommission cr1-eqsin - https://phabricator.wikimedia.org/T256947 (10ayounsi) This router will annoy us until the end. It is stuck in a reboot loop at the U-boot level. It looks like something is wrong with the USB controller. `lines=20 U-Boot 2011.12 (Aug 29 2013 - 15:55:06) Buil... [08:35:25] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:26] looks ok, as this is a new host I would deploy and correct issue later [08:35:50] do you know how to provision it? [08:36:00] I can help with that with our provisioning/backup tools [08:37:08] check however with kormat as he did some work on changing how pupet works with sections [08:37:22] better to fix naming now, np thanks for the suggestions :) [08:37:23] (03CR) 10Hashar: "Seems it is a regression in Gerrit and {$formatChangeUrl} is no more escape." [puppet] - 10https://gerrit.wikimedia.org/r/608954 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [08:38:04] (03CR) 10Jcrespo: [C: 03+1] "Looks ok, but check with kormat as he may have some tips in terms of puppetization of new sections" [puppet] - 10https://gerrit.wikimedia.org/r/609106 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [08:38:51] no idea about how to set up the backup/replica settings, today my goal is to create the mariadb instances :) [08:38:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove jessie base images building process [puppet] - 10https://gerrit.wikimedia.org/r/587529 (https://phabricator.wikimedia.org/T249724) (owner: 10Alexandros Kosiaris) [08:39:46] (03CR) 10JMeybohm: [C: 03+2] "Thanks Alex!" (032 comments) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/608088 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:41:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Note that all the node6 images used in CI are based on wikimedia-jessie; we're blocked on removing this by the migration of production s" [puppet] - 10https://gerrit.wikimedia.org/r/587529 (https://phabricator.wikimedia.org/T249724) (owner: 10Alexandros Kosiaris) [08:42:59] (03Merged) 10jenkins-bot: Add patches for swift auth and bind interface [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/608088 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:44:17] 10Operations, 10ops-eqiad, 10DBA, 10User-Kormat: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10Kormat) a:03Marostegui [08:45:14] (03CR) 10Privacybatm: "Checking why is it happening!" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [08:46:01] (03CR) 10Elukey: [C: 03+2] Rename analytics meta mariadb instance for Backup host [puppet] - 10https://gerrit.wikimedia.org/r/609106 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [08:46:21] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 57 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:52:09] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:52:10] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) >>! In T256444#6268449, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/Q38pBnMBv... [08:53:39] (03PS1) 10Ayounsi: Use flex-flow-sizing on MX204 [homer/public] - 10https://gerrit.wikimedia.org/r/609109 (https://phabricator.wikimedia.org/T248394) [08:54:33] (03CR) 10Ayounsi: [C: 03+2] Use flex-flow-sizing on MX204 [homer/public] - 10https://gerrit.wikimedia.org/r/609109 (https://phabricator.wikimedia.org/T248394) (owner: 10Ayounsi) [08:55:02] (03Merged) 10jenkins-bot: Use flex-flow-sizing on MX204 [homer/public] - 10https://gerrit.wikimedia.org/r/609109 (https://phabricator.wikimedia.org/T248394) (owner: 10Ayounsi) [08:56:30] 10Operations, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10ema) p:05Triage→03Medium [08:59:32] !log deploy flex flow for MX204s - T248394 [08:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:37] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [09:05:15] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) 05Open→03Resolved Thanks @jcrespo Thanks for helping this is all set up and ready for the ticket to close however this database will... [09:05:18] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: CAS Store U2f tokens in a database - https://phabricator.wikimedia.org/T256113 (10jbond) [09:07:25] !log addshore@mwmaint1002:~$ mwscript maintenance/createAndPromote.php --wiki testwikidatawiki --force --custom-groups oversight "Addshore" # T256949 [09:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:30] T256949: The streaming updater should support suppressed deletes - https://phabricator.wikimedia.org/T256949 [09:07:36] !log addshore@mwmaint1002:~$ mwscript maintenance/createAndPromote.php --wiki testwikidatawiki --force --custom-groups oversight "DCausse_(WMF)" # T256949 [09:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:56] (03PS5) 10Kormat: mariadb: Add role and section profiles to remaining mariadb roles. [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) [09:09:50] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) [09:09:55] (03PS6) 10Ayounsi: Add graceful-switchover to multiple RE devices [homer/public] - 10https://gerrit.wikimedia.org/r/592938 (https://phabricator.wikimedia.org/T191667) [09:12:32] 10Operations, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10ema) The cause is most probably T256444. I'm saying this based on two important pieces of i... [09:12:57] (03PS1) 10Elukey: profile::mariadb::misc::analytics::multiinstance: use underscore [puppet] - 10https://gerrit.wikimedia.org/r/609112 (https://phabricator.wikimedia.org/T234826) [09:13:10] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "The statement in the comment is still true. We need systemd >= 239 which is not available in the standard stretch repository." [puppet] - 10https://gerrit.wikimedia.org/r/609104 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [09:15:32] (03CR) 10Elukey: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/609104 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [09:15:40] (03CR) 10Ayounsi: [C: 03+2] Add graceful-switchover to multiple RE devices [homer/public] - 10https://gerrit.wikimedia.org/r/592938 (https://phabricator.wikimedia.org/T191667) (owner: 10Ayounsi) [09:16:03] (03CR) 10Elukey: [C: 03+2] profile::mariadb::misc::analytics::multiinstance: use underscore [puppet] - 10https://gerrit.wikimedia.org/r/609112 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [09:16:07] (03Merged) 10jenkins-bot: Add graceful-switchover to multiple RE devices [homer/public] - 10https://gerrit.wikimedia.org/r/592938 (https://phabricator.wikimedia.org/T191667) (owner: 10Ayounsi) [09:16:57] (03PS6) 10Ayounsi: add graceful-restart to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/577564 (https://phabricator.wikimedia.org/T191667) (owner: 10CDanis) [09:17:27] 10Operations, 10Analytics, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) [09:17:30] (03CR) 10Ayounsi: [C: 03+2] add graceful-restart to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/577564 (https://phabricator.wikimedia.org/T191667) (owner: 10CDanis) [09:17:53] (03Merged) 10jenkins-bot: add graceful-restart to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/577564 (https://phabricator.wikimedia.org/T191667) (owner: 10CDanis) [09:19:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/608966 (https://phabricator.wikimedia.org/T256881) (owner: 10Legoktm) [09:19:10] 10Operations, 10Analytics, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) @fgiunchedi: the `puppetmaster` module still has some ganglia-related things such as `prometheus-ganglia-gen`. Is that still needed? [09:19:21] (03CR) 10Jbond: [C: 03+1] hiera: delete yaml files for non-existing hosts [puppet] - 10https://gerrit.wikimedia.org/r/608955 (owner: 10Dzahn) [09:22:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608955 (owner: 10Dzahn) [09:22:52] 10Operations, 10Traffic: Certain links being rejected by caching if opened in Internet Explorer with a HTTP 400 error - https://phabricator.wikimedia.org/T256302 (10ema) p:05Triage→03Medium [09:23:32] (03PS9) 10JMeybohm: chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) [09:24:21] (03CR) 10JMeybohm: chartmuseum: Add initial module, profile and role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:25:07] (03CR) 10Aaron Schulz: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [09:25:09] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/609104 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [09:26:27] 10Operations, 10Traffic: Certain links being rejected by caching if opened in Internet Explorer with a HTTP 400 error - https://phabricator.wikimedia.org/T256302 (10ema) Interesting, I'd be inclined to think that the issue here cannot be simply the user agent or we would know. :-) - Is the problem reproducibl... [09:27:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:28:46] !log imported chartmuseum_0.12.0-2 to buster-wikimedia - T253843 [09:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:51] T253843: Move helm chart repository out of git - https://phabricator.wikimedia.org/T253843 [09:29:04] (03CR) 10Gehel: [C: 03+1] "LGTM in principle, obviously the failing tests need to be fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper) [09:29:35] (03CR) 10JMeybohm: [C: 03+2] chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:31:25] (03Abandoned) 10Ayounsi: Add PIM stanza for CR devices [homer/public] - 10https://gerrit.wikimedia.org/r/549689 (owner: 10Ayounsi) [09:32:23] (03PS1) 10Gehel: snapshots: enable dumps of structured data from commons [puppet] - 10https://gerrit.wikimedia.org/r/609114 (https://phabricator.wikimedia.org/T221917) [09:33:05] PROBLEM - MariaDB Replica SQL: matomo on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:33:15] PROBLEM - MariaDB Replica IO: analytics-meta on db1108 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:33:17] PROBLEM - MariaDB Replica Lag: matomo on db1108 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:33:27] (03CR) 10ArielGlenn: [C: 03+1] "At last." [puppet] - 10https://gerrit.wikimedia.org/r/609114 (https://phabricator.wikimedia.org/T221917) (owner: 10Gehel) [09:33:45] PROBLEM - MariaDB read only matomo on db1108 is CRITICAL: Could not connect to localhost:3321 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [09:33:54] (03CR) 10Gehel: [C: 03+2] snapshots: enable dumps of structured data from commons [puppet] - 10https://gerrit.wikimedia.org/r/609114 (https://phabricator.wikimedia.org/T221917) (owner: 10Gehel) [09:34:11] PROBLEM - MariaDB Replica SQL: analytics-meta on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:34:17] PROBLEM - MariaDB Replica IO: matomo on db1108 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:34:46] I literally just renewed downtime [09:34:56] so lucky :D [09:35:20] !log advertise codfw prefixes from eqord [09:35:21] sorry for the spam, wip [09:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:44] (03PS3) 10Jbond: icinga: move server wide config to httpd::conf define [puppet] - 10https://gerrit.wikimedia.org/r/608417 [09:36:00] (03CR) 10Jbond: "update thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608417 (owner: 10Jbond) [09:42:56] (03PS4) 10Jbond: icinga: move server wide config to httpd::conf define [puppet] - 10https://gerrit.wikimedia.org/r/608417 [09:42:58] (03PS12) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [09:43:15] (03CR) 10Jbond: "> Patch Set 11: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [09:43:20] (03PS1) 10Ayounsi: Advertize codfw space from eqord [homer/public] - 10https://gerrit.wikimedia.org/r/609120 [09:44:03] (03CR) 10Ayounsi: [C: 03+2] Advertize codfw space from eqord [homer/public] - 10https://gerrit.wikimedia.org/r/609120 (owner: 10Ayounsi) [09:44:29] (03Merged) 10jenkins-bot: Advertize codfw space from eqord [homer/public] - 10https://gerrit.wikimedia.org/r/609120 (owner: 10Ayounsi) [09:48:17] (03CR) 10Jbond: [C: 03+2] scap::target: make this resource ensurable [puppet] - 10https://gerrit.wikimedia.org/r/608811 (owner: 10Jbond) [09:48:57] (03CR) 10Jbond: [C: 03+2] security::access::config: Add types to define [puppet] - 10https://gerrit.wikimedia.org/r/608859 (owner: 10Jbond) [09:49:16] (03PS6) 10Kormat: mariadb: Add role and section profiles to remaining mariadb roles. [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) [09:50:21] (03Abandoned) 10Ayounsi: Netbox, make non sensitive models public [puppet] - 10https://gerrit.wikimedia.org/r/526686 (https://phabricator.wikimedia.org/T226331) (owner: 10Ayounsi) [09:50:40] (03CR) 10Kormat: "> Everything here seems right, but I think there is missing roles? (dbstores)" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [09:51:00] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [09:52:05] (03Abandoned) 10Ayounsi: Use arping to detect duplicated IPs [puppet] - 10https://gerrit.wikimedia.org/r/435797 (https://phabricator.wikimedia.org/T189522) (owner: 10Ayounsi) [09:58:19] (03PS7) 10Kormat: mariadb: Add role and section profiles to remaining mariadb roles. [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) [09:58:39] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [09:59:15] 10Operations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Prevention): Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) [10:00:37] 10Operations, 10Analytics-Radar, 10Wikimedia-Logstash, 10observability, 10Performance-Team (Radar): Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) We are on the Kafka pipeline for MW logs that were sent to logstash ov... [10:01:52] 10Operations, 10Analytics, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10fgiunchedi) >>! In T253555#6273943, @ema wrote: > @fgiunchedi: the `puppetmaster` module still has some ganglia-related things such as `prometheus-ganglia-gen`. Is that still needed?... [10:03:01] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10fgiunchedi) 05Open→03Resolved This is done, thanks @herron for putting the new VMs in service [10:03:04] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) [10:03:50] (03CR) 10Jcrespo: "@Jbond, you detected many existing issue with current puppetization. However, those are not in scope of this patch. We are slowly working " [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [10:04:19] jouncebot: now [10:04:19] For the next 20 hour(s) and 55 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200702T0700) [10:04:53] oh ... :P [10:05:33] * addshore missed the fact that there are no backport windows today [10:06:40] (03CR) 10Jbond: "> Patch Set 7:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [10:07:46] (03PS1) 10JMeybohm: secret: add dummy key for helm-charts (chartmuseum) [labs/private] - 10https://gerrit.wikimedia.org/r/609121 (https://phabricator.wikimedia.org/T253843) [10:08:55] (03Abandoned) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [10:09:00] 10Operations, 10ops-codfw, 10observability: ps1-c3-codfw icinga checks UNKNOWN - https://phabricator.wikimedia.org/T256953 (10fgiunchedi) [10:09:18] (03CR) 10Jbond: profile::mariadb::misc: create generic profile for misc classes (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608895 (owner: 10Jbond) [10:10:08] (03PS1) 10JMeybohm: Add certificate for helm-charts (chartmuseum) [puppet] - 10https://gerrit.wikimedia.org/r/609122 (https://phabricator.wikimedia.org/T253843) [10:11:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/608881 (owner: 10Muehlenhoff) [10:11:23] db issues on phab, did anyone saw them? [10:11:59] it seems to be working fine for me [10:13:17] there is a spike of activity, but db seems ok [10:13:28] maybe network issue between client and server? [10:14:16] strange [10:16:09] (03PS3) 10Jbond: profile::waf::apache::administrative: remove waf config [puppet] - 10https://gerrit.wikimedia.org/r/608806 (https://phabricator.wikimedia.org/T253632) [10:17:11] jynus: yes I see them [10:18:14] no icinga alert? [10:18:29] and now it seems back to normal? [10:21:39] (03CR) 10Jcrespo: "I am not sure this is the right direction. There are things that should be done for sure:" [puppet] - 10https://gerrit.wikimedia.org/r/608895 (owner: 10Jbond) [10:22:24] (03PS4) 10Jbond: profile::waf::apache::administrative: remove waf config [puppet] - 10https://gerrit.wikimedia.org/r/608806 (https://phabricator.wikimedia.org/T253632) [10:22:26] (03PS2) 10Jbond: block_abuse_nets: enable block abuse nets on misc sites [puppet] - 10https://gerrit.wikimedia.org/r/608807 (https://phabricator.wikimedia.org/T253632) [10:23:01] (03Abandoned) 10Jbond: mariadb::profile::firewall: use the profile::mariadb::ferm type [puppet] - 10https://gerrit.wikimedia.org/r/608631 (owner: 10Jbond) [10:23:17] (03Abandoned) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596 (owner: 10Jbond) [10:24:45] (03PS8) 10Kormat: mariadb: Add role and section profiles to remaining mariadb roles. [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) [10:25:26] the proxy says server is up [10:25:31] and didn't ever went down [10:26:07] (03CR) 10Kormat: "check experimental" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [10:27:08] https://grafana.wikimedia.org/d/000000377/host-overview?panelId=31&fullscreen&orgId=1&refresh=5m&var-server=dbproxy1016&var-datasource=thanos&var-cluster=mysql&from=1593599222176&to=1593685622176 [10:27:24] spike on tcp retrasmits [10:28:33] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608417 (owner: 10Jbond) [10:29:16] (03CR) 10Jbond: mariadb: Add role and section profiles to remaining mariadb roles. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [10:32:49] 10Operations, 10observability: Port Prometheus dashboards to Thanos - https://phabricator.wikimedia.org/T256954 (10fgiunchedi) [10:34:49] !log move "cluster overview" dashboard to Thanos - T256954 [10:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:54] T256954: Port Prometheus dashboards to Thanos - https://phabricator.wikimedia.org/T256954 [10:39:45] (03CR) 10Jbond: "Note this is not meant to be a complete implementations it is instead a starting point" [puppet] - 10https://gerrit.wikimedia.org/r/608895 (owner: 10Jbond) [10:42:17] (03PS5) 10Jbond: icinga: move server wide config to httpd::conf define [puppet] - 10https://gerrit.wikimedia.org/r/608417 [10:43:16] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608417 (owner: 10Jbond) [10:45:06] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [10:45:18] (03CR) 10Jcrespo: "May I ask for a body of what the patch does and why? Right now there is no explanation, just a list of CI-related hosts. https://www.media" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [10:45:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, PCC is also sane: https://puppet-compiler.wmflabs.org/compiler1003/23626/" [puppet] - 10https://gerrit.wikimedia.org/r/608877 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [10:48:30] (03CR) 10Kormat: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [10:57:17] (03CR) 10Jcrespo: "As far as I can see, no misc role could use this profile:" [puppet] - 10https://gerrit.wikimedia.org/r/608895 (owner: 10Jbond) [11:02:42] RECOVERY - snapshot of s1 in codfw on db2093 is OK: Last snapshot for s1 at codfw (db2097.codfw.wmnet:3311) taken on 2020-07-02 09:32:02 (1008 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [11:02:44] RECOVERY - snapshot of s1 in eqiad on db2093 is OK: Last snapshot for s1 at eqiad (db1139.eqiad.wmnet:3311) taken on 2020-07-02 09:26:46 (983 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [11:03:07] (03CR) 10Jcrespo: "> Can you point me to this discussion" [puppet] - 10https://gerrit.wikimedia.org/r/608895 (owner: 10Jbond) [11:04:22] PROBLEM - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 354 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [11:04:32] PROBLEM - MariaDB Replica Lag: s4 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1128.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:04:48] PROBLEM - Docker registry HTTPS interface on registry2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2001.codfw.wmnet:443/v2/wikimedia-stretch/manifests/latest - 354 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Docker [11:04:56] I am taking care of db1145 [11:05:01] PROBLEM - LVS docker-registry codfw port 443/tcp - docker registry service IPv4 #page on docker-registry.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 354 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:05:10] PROBLEM - Docker registry HTTPS interface on registry2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2002.codfw.wmnet:443/v2/wikimedia-stretch/manifests/latest - 354 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Docker [11:05:29] ^ akosiaris [11:05:34] uh oh [11:05:36] PROBLEM - Docker registry health on registry2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 287 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [11:05:42] expected due to changes? [11:05:49] about to get food but ping me if I can help [11:05:53] * akosiaris around [11:05:55] jayme: ^ [11:05:58] PROBLEM - Docker registry health on registry2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 287 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Docker [11:05:58] * apergos looks in [11:05:59] here [11:06:20] looks like only codfw for now ? [11:06:31] <_joe_> yes but it's the active one [11:06:38] * akosiaris looking [11:07:19] <_joe_> akosiaris: monitoring is going to /v2/wikimedia-stretch/manifests/latest [11:07:21] <_joe_> lol [11:07:34] /o\ [11:07:38] what does that mean, is the a new version? [11:07:39] hrhr [11:07:43] *there [11:07:44] <_joe_> ok, back to lunch [11:07:51] <_joe_> can someone else fix that? [11:08:02] <_joe_> the service is up and running [11:08:12] yeah, but I didn't get where the error is? [11:08:12] why do you say it's a monitoring problem? [11:09:22] <_joe_> Didn't we remove it? [11:09:35] stretch? no [11:09:45] neither did we remove jessie yet [11:09:46] jessie was removed AFAIK [11:09:49] ah [11:09:54] we just don't update the image anymore [11:09:57] <_joe_> Oh sorry then [11:10:03] we will remove wikimedia-jessie in 1 month from now [11:11:27] not sure yet what's going on [11:11:37] HTTP/1.1 503 Service Unavailable {"errors":[{"code":"UNAVAILABLE","message":"service unavailable","detail":"health check failed: please see /debug/health"}]} [11:12:28] :"swift: Get https://swift.svc.codfw.wmnet/auth/v1.0: x509: certificate has expired or is not yet valid"} [11:12:31] but /debug/health gives me a 404 [11:12:32] 7o\ [11:12:42] oh, there we go [11:12:53] _joe_: your plan failed twice already :P [11:12:56] ooof, checking [11:13:16] both docker "products" unable to reload CA at runtime... [11:13:30] <_joe_> so it's the registry? [11:13:40] jayme: so, should I just try to restart docker-registry? [11:13:44] 1 by 1 on the 2 hosts? [11:13:55] <_joe_> doing so on 2001 [11:13:58] <_joe_> it should be harmless [11:14:07] (03PS1) 10Ema: prometheus: remove ganglia leftovers [puppet] - 10https://gerrit.wikimedia.org/r/609130 (https://phabricator.wikimedia.org/T253555) [11:14:08] ok, go for it [11:14:12] why did this trigger now and not on Monday, was there some patch which triggered a service restart or so? [11:14:15] <_joe_> !log restarting docker-registry on registry2001 [11:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:22] btw the error log for nginx has a lot of whines for GET / from icinga1001 [11:14:40] RECOVERY - Docker registry health on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [11:14:46] moritzm: doubtful Active: active (running) since Fri 2019-07-12 11:03:03 UTC; 11 months 21 days ago [11:15:13] <_joe_> done [11:15:23] "GET /v2/wikimedia-stretch/manifests/latest HTTP/1.1" 200 2071 "" "check_http/v2.2 (monitoring-plugins 2.2)" [11:15:24] <_joe_> on 2001 now we get a 200 for monitoring [11:15:25] cool [11:15:29] (03PS1) 10Ema: puppetmaster: remove ganglia leftovers [puppet] - 10https://gerrit.wikimedia.org/r/609131 (https://phabricator.wikimedia.org/T253555) [11:15:34] ok doing 2002 then [11:15:39] <_joe_> cool [11:15:44] RECOVERY - Docker registry HTTPS interface on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 2567 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Docker [11:15:54] <_joe_> let's check eqiad too [11:15:57] RECOVERY - LVS docker-registry codfw port 443/tcp - docker registry service IPv4 #page on docker-registry.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 292 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:16:00] why hasn't docker-registry100X complained? [11:16:05] yeah, checking eqiad now [11:16:06] RECOVERY - Docker registry HTTPS interface on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 2567 bytes in 0.231 second response time https://wikitech.wikimedia.org/wiki/Docker [11:16:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608417 (owner: 10Jbond) [11:16:22] !log restart docker-registry on registry2002 for CA refresh [11:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:41] <_joe_> now, it's still returning 200 [11:16:45] <_joe_> for some reason [11:16:52] <_joe_> but I'd restart it there too [11:16:54] RECOVERY - Docker registry health on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [11:16:57] yeah, doing so now [11:17:06] maybe it's some long lived connection or whatever [11:17:07] ahh there's the 200 in the logs at last (for registry2002) [11:17:10] RECOVERY - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 292 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [11:17:12] <_joe_> ok, finally back to lunch for real [11:17:20] enjoy! [11:17:26] +1 [11:18:10] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: remove ganglia leftovers [puppet] - 10https://gerrit.wikimedia.org/r/609130 (https://phabricator.wikimedia.org/T253555) (owner: 10Ema) [11:18:20] (03CR) 10Jcrespo: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [11:18:20] !log preactively restart docker-registry on registry1001, registry1002 to force CA refresh [11:18:21] 200 for / as well [11:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:39] (03CR) 10Filippo Giunchedi: [C: 03+1] puppetmaster: remove ganglia leftovers [puppet] - 10https://gerrit.wikimedia.org/r/609131 (https://phabricator.wikimedia.org/T253555) (owner: 10Ema) [11:19:02] PROBLEM - MariaDB Replica Lag: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1087.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:19:38] ok I think I can go to lunch now too. Ping me if any issues arise [11:21:05] 10Operations, 10CAS-SSO, 10User-jbond: Enable CAS authentications on librenms - https://phabricator.wikimedia.org/T256958 (10jbond) p:05Triage→03Medium [11:21:16] (03PS1) 10Privacybatm: Transferer.py: Produce correct error when users try with a wrong host. [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) [11:22:47] (03CR) 10Ema: [C: 03+2] purged: alert if local backlog grows past the given limits [puppet] - 10https://gerrit.wikimedia.org/r/608564 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [11:27:35] (03CR) 10Ema: [C: 03+1] Remove obsolete pinning for liblwloc [puppet] - 10https://gerrit.wikimedia.org/r/609100 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [11:28:23] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10ssingh) a:03ssingh [11:29:14] (03PS8) 10Jbond: WIP: profile::mariadb::misc: create generic profile for misc classes [puppet] - 10https://gerrit.wikimedia.org/r/608895 [11:29:18] (03CR) 10Vgutierrez: [C: 03+1] Remove obsolete pinning for liblwloc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609100 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [11:30:42] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10ssingh) Hi @Edtadros: Once you complete (read and sign) the L3 document, I will resolve this task (today). Thanks! [11:31:01] (03CR) 10Jbond: "Just to clarify this is not expected to be a complete profile which coveres all uses cases. instead it is meant as a starting point to wo" [puppet] - 10https://gerrit.wikimedia.org/r/608895 (owner: 10Jbond) [11:31:26] (03PS2) 10Muehlenhoff: Remove obsolete pinning for libhwloc [puppet] - 10https://gerrit.wikimedia.org/r/609100 (https://phabricator.wikimedia.org/T256877) [11:31:53] 10Operations, 10Traffic, 10Patch-For-Review, 10Sustainability (Incident Prevention): monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10ema) [11:32:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! I'll repeat my practical live test when this is merged." [puppet] - 10https://gerrit.wikimedia.org/r/608878 (https://phabricator.wikimedia.org/T256536) (owner: 10Jbond) [11:33:53] !log rolling restart of codfw load balancers to catch up on kernel upgrades [11:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:35:42] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:30] 10Operations, 10Traffic, 10Patch-For-Review, 10Sustainability (Incident Prevention): monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10ema) @cdanis all done except for `rdkafka_consumer_topics_partitions_consumer_lag`, there's silence on grafana.wikimedia.org/explore when look... [11:39:19] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:39:25] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:40:35] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 53, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:40:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 52, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:40:43] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:40:43] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:52] (03CR) 10Jbond: [C: 03+2] apero_cas: add abbility to configure per service properties [puppet] - 10https://gerrit.wikimedia.org/r/608877 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [11:43:10] (03CR) 10Jbond: [C: 03+2] role::idp: disable X-frame-options for icinga [puppet] - 10https://gerrit.wikimedia.org/r/608878 (https://phabricator.wikimedia.org/T256536) (owner: 10Jbond) [11:44:16] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Make Graphite httpd site configurable [puppet] - 10https://gerrit.wikimedia.org/r/608881 (owner: 10Muehlenhoff) [11:44:33] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:44:39] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:44:53] (03PS1) 10Ayounsi: Remove nonstop-bridging from switches [homer/public] - 10https://gerrit.wikimedia.org/r/609139 (https://phabricator.wikimedia.org/T191667) [11:48:08] (03CR) 10Ayounsi: "https://www.juniper.net/documentation/en_US/junos/topics/reference/general/nonstop-bridging-system-requirements.html" [homer/public] - 10https://gerrit.wikimedia.org/r/609139 (https://phabricator.wikimedia.org/T191667) (owner: 10Ayounsi) [11:49:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:50:45] 10Operations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Prevention): Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) [11:51:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:51:55] elukey: dbstore1005 s8 replicatin stopped, but I cannot see why [11:52:10] are you doing maintenance there? [11:52:22] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10ssingh) [11:53:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:53:13] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:42] (03PS2) 10Muehlenhoff: Remove IDP defintions for logstash vhosts [puppet] - 10https://gerrit.wikimedia.org/r/607509 (https://phabricator.wikimedia.org/T246998) [11:57:37] (03CR) 10Jbond: "> > > May I ask for a body of what the patch does and why? Right now there is no explanation, just a list of CI-related hosts. https://www" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [12:01:30] (03CR) 10Ema: [C: 03+2] prometheus: remove ganglia leftovers [puppet] - 10https://gerrit.wikimedia.org/r/609130 (https://phabricator.wikimedia.org/T253555) (owner: 10Ema) [12:01:40] (03PS2) 10Ema: prometheus: remove ganglia leftovers [puppet] - 10https://gerrit.wikimedia.org/r/609130 (https://phabricator.wikimedia.org/T253555) [12:02:46] PROBLEM - puppet last run on ganeti5003 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:05:08] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 52, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:06:20] those BGP alerts are my fault [12:06:29] ACKNOWLEDGEMENT - MariaDB Replica Lag: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3869.96 seconds Jcrespo https://phabricator.wikimedia.org/T256966 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [12:08:22] RECOVERY - puppet last run on ganeti5003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:08:22] (03PS3) 10Muehlenhoff: Remove IDP defintions for logstash vhosts [puppet] - 10https://gerrit.wikimedia.org/r/607509 (https://phabricator.wikimedia.org/T246998) [12:10:21] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:10:22] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:34] !log pre-configure asw2-b-eqiad<->cloudsw1-c8-eqiad - T251632 [12:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:38] T251632: (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 [12:13:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609131 (https://phabricator.wikimedia.org/T253555) (owner: 10Ema) [12:13:57] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/23631/" [puppet] - 10https://gerrit.wikimedia.org/r/607509 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff) [12:15:08] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 53, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:15:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:17:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:17:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10ayounsi) @Jclark-ctr or @Cmjohnson some information are missing in Netbox about the cabling: See https://netbox.wikime... [12:18:56] (03PS1) 10Jbond: apereo_cas: fix service properties entry [puppet] - 10https://gerrit.wikimedia.org/r/609142 [12:19:52] (03CR) 10Jbond: [C: 03+2] apereo_cas: fix service properties entry [puppet] - 10https://gerrit.wikimedia.org/r/609142 (owner: 10Jbond) [12:20:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:21:26] (03CR) 10QChris: "> I guess that patch should have updated the use of {$formatChangeUrl} and pass them through" [puppet] - 10https://gerrit.wikimedia.org/r/608954 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [12:24:48] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:26:21] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove obsolete pinning for libhwloc [puppet] - 10https://gerrit.wikimedia.org/r/609100 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [12:27:22] !log rolling restart of esams load balancers to catch up on kernel upgrades [12:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:57] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:27:58] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:28:51] idp is timing out for me [12:29:20] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:29:33] kormat: sorry thats me should be fixed momenteraly [12:29:33] ^^ that's me again [12:29:40] (03PS2) 10Ema: puppetmaster: remove ganglia leftovers [puppet] - 10https://gerrit.wikimedia.org/r/609131 (https://phabricator.wikimedia.org/T253555) [12:29:42] jbond42: ah ok :) [12:31:47] !log rebooting mw1349-mw1369 for kernel security updates [12:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:02] (03CR) 10Ema: [C: 03+2] puppetmaster: remove ganglia leftovers [puppet] - 10https://gerrit.wikimedia.org/r/609131 (https://phabricator.wikimedia.org/T253555) (owner: 10Ema) [12:32:28] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:32:29] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:36] 10Operations, 10vm-requests: Site: eqiad/codwf each 1 VM for helm-charts.wikimedia.org (chartmuseum) - https://phabricator.wikimedia.org/T256970 (10JMeybohm) [12:33:24] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:33:29] kormat: should be back sorry [12:33:47] jbond42: it's working now. no worries, i was mostly commenting to give some context to the alert [12:34:05] (03PS1) 10Jbond: idp: quote bool value [puppet] - 10https://gerrit.wikimedia.org/r/609145 [12:34:14] ahh thanks [12:34:33] (03CR) 10Jbond: [C: 03+2] idp: quote bool value [puppet] - 10https://gerrit.wikimedia.org/r/609145 (owner: 10Jbond) [12:35:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:36:33] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:44] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:36:57] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:18] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) [12:37:24] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) 05Open→03Resolved [12:37:36] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Add mysql_role and section profiles to remaining mariadb roles - https://phabricator.wikimedia.org/T256866 (10Kormat) [12:38:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607509 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff) [12:38:32] (03PS9) 10Kormat: mariadb: Add role and section profiles to remaining mariadb roles. [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) [12:41:32] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:43:43] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:43:43] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:32] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:45:07] (03PS6) 10Jbond: icinga: move server wide config to httpd::conf define [puppet] - 10https://gerrit.wikimedia.org/r/608417 [12:46:33] (03CR) 10Jbond: [C: 03+2] icinga: move server wide config to httpd::conf define [puppet] - 10https://gerrit.wikimedia.org/r/608417 (owner: 10Jbond) [12:47:12] 10Operations, 10CX-cxserver, 10Citoid, 10Core Platform Team, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10MSantos) >>! In T218217#6263001, @Physikerwelt wrote: > I do not really understand what needs to be done within mathoid. Mathoid has two... [12:47:59] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:48:26] 10Operations, 10CX-cxserver, 10Citoid, 10Core Platform Team, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10MSantos) [12:51:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:51:28] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [12:52:07] (03CR) 10Gehel: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [12:52:10] (03CR) 10Kormat: "> What I mean is the commit body of this proposed patch lacks a generic explanation on its own. A general explanation of why and what is b" [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [12:53:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:54:32] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10sguebo_WMF) [12:54:44] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:57:09] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10JanWMF) approved [12:57:32] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:01:10] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:01:37] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [13:03:30] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) p:05Triage→03Medium [13:04:38] RECOVERY - snapshot of s8 in eqiad on db2093 is OK: Last snapshot for s8 at eqiad (db1116.eqiad.wmnet:3318) taken on 2020-07-02 09:36:48 (1082 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [13:05:45] (03CR) 10Paladox: gerrit: Format short gerrit URLs in phabricator comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608954 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [13:06:27] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [13:08:04] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10ssingh) a:03ssingh [13:08:16] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [13:09:08] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [13:13:44] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 23535 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:22:23] (03CR) 10ZPapierski: [C: 04-1] Handle oauth proxy settings (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [13:22:41] (03PS2) 10Ottomata: Refine: Only filter for allowed domains from external EventLogging data [puppet] - 10https://gerrit.wikimedia.org/r/608443 [13:24:37] (03CR) 10Ottomata: [C: 03+2] Refine: Only filter for allowed domains from external EventLogging data [puppet] - 10https://gerrit.wikimedia.org/r/608443 (owner: 10Ottomata) [13:26:54] (03PS1) 10Giuseppe Lavagetto: restbase: install the service proxy along with the tls termination [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) [13:26:56] (03PS1) 10Giuseppe Lavagetto: wmflib::service: introduce get_url function [puppet] - 10https://gerrit.wikimedia.org/r/609153 [13:26:58] (03PS1) 10Giuseppe Lavagetto: restbase: switch to using get_url [puppet] - 10https://gerrit.wikimedia.org/r/609154 (https://phabricator.wikimedia.org/T255133) [13:27:00] (03PS1) 10Giuseppe Lavagetto: restbase: use the services proxy for everything but parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) [13:27:21] 10Operations, 10DBA, 10User-Kormat: Remove unused parameters from profile::mariadb::monitor::prometheus - https://phabricator.wikimedia.org/T256879 (10Kormat) p:05Triage→03Medium [13:27:50] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [13:27:55] 10Operations, 10DBA, 10User-Kormat: Remove unused parameters from profile::mariadb::monitor::prometheus - https://phabricator.wikimedia.org/T256879 (10Kormat) [13:28:13] (03CR) 10jerkins-bot: [V: 04-1] restbase: install the service proxy along with the tls termination [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) (owner: 10Giuseppe Lavagetto) [13:28:26] (03CR) 10jerkins-bot: [V: 04-1] restbase: switch to using get_url [puppet] - 10https://gerrit.wikimedia.org/r/609154 (https://phabricator.wikimedia.org/T255133) (owner: 10Giuseppe Lavagetto) [13:28:58] (03CR) 10jerkins-bot: [V: 04-1] wmflib::service: introduce get_url function [puppet] - 10https://gerrit.wikimedia.org/r/609153 (owner: 10Giuseppe Lavagetto) [13:29:15] (03CR) 10jerkins-bot: [V: 04-1] restbase: use the services proxy for everything but parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) (owner: 10Giuseppe Lavagetto) [13:37:09] RECOVERY - snapshot of s8 in codfw on db2093 is OK: Last snapshot for s8 at codfw (db2100.codfw.wmnet:3318) taken on 2020-07-02 09:44:16 (1138 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [13:39:00] (03CR) 10Jcrespo: mariadb: Add role and section profiles to remaining mariadb roles. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [13:42:24] (03CR) 10Kormat: mariadb: Add role and section profiles to remaining mariadb roles. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [13:43:49] (03CR) 10QChris: gerrit: Format short gerrit URLs in phabricator comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608954 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [13:43:51] (03PS1) 10Ssingh: admin: add Edward Tadros to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/609158 (https://phabricator.wikimedia.org/T256435) [13:45:09] (03CR) 10QChris: "Maybe I'm missing the point here ... but do we want to see those Gerrit URLs with project names" [puppet] - 10https://gerrit.wikimedia.org/r/608954 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [13:46:36] (03CR) 10Jcrespo: [C: 03+1] mariadb: Add role and section profiles to remaining mariadb roles. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [13:47:02] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [13:47:35] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [13:49:36] (03PS1) 10Elukey: Allow db1108 to replicate data from matomo and meta databases [puppet] - 10https://gerrit.wikimedia.org/r/609160 (https://phabricator.wikimedia.org/T234826) [13:51:47] (03CR) 10Elukey: [C: 03+2] Allow db1108 to replicate data from matomo and meta databases [puppet] - 10https://gerrit.wikimedia.org/r/609160 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [13:52:12] elukey: hi! mariadb@s8 crashed on dbstore1005 (T256966) and probably corrupted its data. i'm going to restore that section from backups to see if that fixes the issue. is it ok to stop the instance while that's in progress? replication has already been down for 2h48m [13:52:13] T256966: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 [13:52:39] kormat: yes please go ahead, I saw the task, thanks a lot for the work! [13:52:49] 10Operations, 10observability, 10User-fgiunchedi: Port Prometheus dashboards to Thanos - https://phabricator.wikimedia.org/T256954 (10fgiunchedi) [13:52:50] cheers :) [13:53:48] godog: jynus: [13:53:50] sorry [13:55:11] jbond42: is this your AAAS? (apologies as a service) [13:56:08] :D [13:56:34] haha! great addition to /go [13:58:54] 10Operations, 10Traffic, 10Sustainability (Incident Prevention): monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10CDanis) 05Open→03Resolved a:03CDanis I was thinking about this metric: `kafka_burrow_partition_lag{topic=~".*\\.resource-purge",group=~"cp.*"}` -- grafana exp... [13:58:57] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10CDanis) [13:59:07] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10CDanis) [14:00:03] (03PS1) 10Elukey: Allow db1108's IPv6 address to replicate matomo/meta databases [puppet] - 10https://gerrit.wikimedia.org/r/609161 (https://phabricator.wikimedia.org/T234826) [14:03:14] !log stopped mariadb@s8 on dbstore1005 for data restoration T256966 [14:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:21] T256966: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 [14:06:31] PROBLEM - mysqld processes on dbstore1005 is CRITICAL: PROCS CRITICAL: 3 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:06:37] PROBLEM - MariaDB read only s8 on dbstore1005 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:06:42] ^ that's me [14:07:16] (03PS2) 10Filippo Giunchedi: Move to Debian packaging [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/552486 (https://phabricator.wikimedia.org/T217340) [14:07:33] (03CR) 10Elukey: [C: 03+2] Allow db1108's IPv6 address to replicate matomo/meta databases [puppet] - 10https://gerrit.wikimedia.org/r/609161 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [14:17:09] (03PS1) 10JMeybohm: Introduce chartmuseum[12]001 [dns] - 10https://gerrit.wikimedia.org/r/609164 (https://phabricator.wikimedia.org/T253843) [14:17:11] (03PS1) 10JMeybohm: Add helm-charts discovery records [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) [14:19:15] (03CR) 10Kormat: [C: 03+2] mariadb: Add role and section profiles to remaining mariadb roles. [puppet] - 10https://gerrit.wikimedia.org/r/608874 (https://phabricator.wikimedia.org/T256866) (owner: 10Kormat) [14:22:58] (03CR) 10Jcrespo: "See comments." (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/609136 (https://phabricator.wikimedia.org/T256951) (owner: 10Privacybatm) [14:24:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/609164 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:25:04] (03PS2) 10Giuseppe Lavagetto: restbase: install the service proxy along with the tls termination [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) [14:25:06] (03PS2) 10Giuseppe Lavagetto: wmflib::service: introduce get_url function [puppet] - 10https://gerrit.wikimedia.org/r/609153 [14:25:08] (03PS2) 10Giuseppe Lavagetto: restbase: switch to using get_url [puppet] - 10https://gerrit.wikimedia.org/r/609154 (https://phabricator.wikimedia.org/T255133) [14:25:10] (03PS2) 10Giuseppe Lavagetto: restbase: use the services proxy for everything but parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) [14:27:27] PROBLEM - MariaDB Replica SQL: matomo on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [14:27:29] PROBLEM - MariaDB Replica IO: matomo on db1108 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [14:27:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609158 (https://phabricator.wikimedia.org/T256435) (owner: 10Ssingh) [14:27:57] PROBLEM - MariaDB Replica Lag: matomo on db1108 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [14:28:23] PROBLEM - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:23] PROBLEM - MariaDB Replica Lag: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [14:28:39] PROBLEM - MariaDB read only matomo on db1108 is CRITICAL: Could not connect to localhost:3321 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:29:08] (03CR) 10Jcrespo: [C: 03+2] transferpy: Use logging package instead of print statements [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) (owner: 10Privacybatm) [14:29:41] (03Merged) 10jenkins-bot: transferpy: Use logging package instead of print statements [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) (owner: 10Privacybatm) [14:29:44] sorryyy for the spam [14:32:43] (03PS3) 10Giuseppe Lavagetto: wmflib::service: introduce get_url function [puppet] - 10https://gerrit.wikimedia.org/r/609153 [14:32:45] (03PS3) 10Giuseppe Lavagetto: restbase: switch to using get_url [puppet] - 10https://gerrit.wikimedia.org/r/609154 (https://phabricator.wikimedia.org/T255133) [14:32:47] (03PS3) 10Giuseppe Lavagetto: restbase: use the services proxy for everything but parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) [14:33:08] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:28] (03PS1) 10Filippo Giunchedi: facilities: update model for ps1-c3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/609174 (https://phabricator.wikimedia.org/T256953) [14:34:50] (03CR) 10jerkins-bot: [V: 04-1] facilities: update model for ps1-c3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/609174 (https://phabricator.wikimedia.org/T256953) (owner: 10Filippo Giunchedi) [14:35:56] (03PS2) 10Filippo Giunchedi: facilities: update model for ps1-c3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/609174 (https://phabricator.wikimedia.org/T256953) [14:36:12] * godog adds a coin to the "working for the tools and not the other way around" jar [14:36:20] re: arrow alignment -1 above [14:37:12] (03PS3) 10Filippo Giunchedi: facilities: update model for ps1-c3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/609174 (https://phabricator.wikimedia.org/T251570) [14:38:16] (03PS1) 10Ema: cumin: update prometheus alias [puppet] - 10https://gerrit.wikimedia.org/r/609178 (https://phabricator.wikimedia.org/T243057) [14:38:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/609178 (https://phabricator.wikimedia.org/T243057) (owner: 10Ema) [14:39:13] (03CR) 10Filippo Giunchedi: [C: 03+2] facilities: update model for ps1-c3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/609174 (https://phabricator.wikimedia.org/T251570) (owner: 10Filippo Giunchedi) [14:39:55] (03CR) 10Filippo Giunchedi: [C: 03+1] cumin: update prometheus alias [puppet] - 10https://gerrit.wikimedia.org/r/609178 (https://phabricator.wikimedia.org/T243057) (owner: 10Ema) [14:40:01] godog: <3 [14:40:16] \o/ thanks for taking a look [14:42:06] elukey: btw, we now have cumin aliases for db hosts. e.g. `cumin A:db-section-matomo` [14:42:10] !log rebooting mw1370-mw1389 for kernel security updates [14:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:24] (or `A:db-role-master`) [14:44:52] (03PS13) 10Jbond: icinga: switch icinga to use apereo cas for authentication [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) [14:47:33] kormat: nice! [14:47:46] RECOVERY - ps1-c3-codfw-infeed-load-tower-B-phase-Z on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-B-phase-Z 177 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:47:50] RECOVERY - ps1-c3-codfw-infeed-load-tower-A-phase-X on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-A-phase-X 168 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:47:54] RECOVERY - ps1-c3-codfw-infeed-load-tower-B-phase-X on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-B-phase-X 142 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:47:54] papaul: ^ [14:48:22] RECOVERY - ps1-c3-codfw-infeed-load-tower-A-phase-Y on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-A-phase-Y 162 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:49:16] godog: cool thanks [14:49:58] RECOVERY - ps1-c3-codfw-infeed-load-tower-A-phase-Z on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-A-phase-Z 151 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:52:08] RECOVERY - ps1-c3-codfw-infeed-load-tower-B-phase-Y on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-B-phase-Y 185 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:52:14] 10Operations, 10Traffic, 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle) [14:52:22] 10Operations, 10DBA, 10Performance-Team (Radar), 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672 (10Krinkle) [14:55:44] 10Operations: Please import Jenkins Debian package 2.235.1 in apt.wikimedia.org - https://phabricator.wikimedia.org/T256980 (10hashar) [14:57:25] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/609164 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:59:38] (03PS3) 10Ottomata: Default PYSPARK_PYTHON to exact versioned python executable used on driver. [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/602386 (https://phabricator.wikimedia.org/T229347) [15:00:42] PROBLEM - Host ganeti1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:00:52] (03CR) 10Ottomata: Default PYSPARK_PYTHON to exact versioned python executable used on driver. (031 comment) [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/602386 (https://phabricator.wikimedia.org/T229347) (owner: 10Ottomata) [15:02:10] (03CR) 10DCausse: [C: 04-1] "I worry of being to close to the limit for commons, for enwiki I think that it does not work for eqiad (configured for 3 replicas)" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper) [15:02:27] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) @akosiaris. finished upgrade host is powering up right now [15:04:56] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 68 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:04:58] RECOVERY - snapshot of s4 in eqiad on db2093 is OK: Last snapshot for s4 at eqiad (db1145.eqiad.wmnet:3314) taken on 2020-07-02 12:38:01 (1222 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:06:12] RECOVERY - Host ganeti1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [15:06:58] (03CR) 10Ssingh: [C: 03+2] wikidough: update firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/608299 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:07:02] RECOVERY - snapshot of s4 in codfw on db2093 is OK: Last snapshot for s4 at codfw (db2099.codfw.wmnet:3314) taken on 2020-07-02 12:45:42 (1243 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:07:45] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:16] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10Jclark-ctr) [15:09:35] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10Cmjohnson) @fgiunchedi icinga1001 is in rack C8, that is now a 10G rack. Do you still want this server there or can we move to another rack that is 1G only and eventually mig... [15:14:22] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10fgiunchedi) >>! In T255072#6274943, @Cmjohnson wrote: > @fgiunchedi icinga1001 is in rack C8, that is now a 10G rack. Do you still want this server there or can we move to an... [15:14:53] msw-a1 and a2 done let me know if you notice any problem in rack a1 and a2 in codfw thanks [15:18:23] (03CR) 10Andrew Bogott: "Could this be split into two or three patches? It's nice to be able to verify that refactor bits are no-ops." [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [15:18:29] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) @gehel @wiki_willy each relforge has 4 ssds [15:19:06] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:46] (03PS1) 10Ema: varnish: update 19-unparseable-host-header.vtc [puppet] - 10https://gerrit.wikimedia.org/r/609179 (https://phabricator.wikimedia.org/T255015) [15:22:08] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:23:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add certificate for helm-charts (chartmuseum) [puppet] - 10https://gerrit.wikimedia.org/r/609122 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [15:23:53] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:53] (03CR) 10MSantos: charts for push-notification service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [15:25:16] (03CR) 10EBernhardson: [C: 03+1] Default PYSPARK_PYTHON to exact versioned python executable used on driver. [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/602386 (https://phabricator.wikimedia.org/T229347) (owner: 10Ottomata) [15:29:50] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) [15:30:33] (03PS1) 10Muehlenhoff: Create component/systemd241 [puppet] - 10https://gerrit.wikimedia.org/r/609181 (https://phabricator.wikimedia.org/T256877) [15:31:26] (03CR) 10EBernhardson: [C: 03+1] Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [15:33:17] (03PS4) 10Giuseppe Lavagetto: restbase: switch to using get_url [puppet] - 10https://gerrit.wikimedia.org/r/609154 (https://phabricator.wikimedia.org/T255133) [15:33:19] (03PS4) 10Giuseppe Lavagetto: restbase: use the services proxy for everything but parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) [15:35:13] (03CR) 10ZPapierski: [C: 04-1] "> Patch Set 8: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [15:35:28] (03CR) 10Jbond: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [15:35:30] (03PS1) 10Rush: peek: update config to reflect changeset 609183 [puppet] - 10https://gerrit.wikimedia.org/r/609184 (https://phabricator.wikimedia.org/T256987) [15:36:11] (03CR) 10Rush: [C: 03+2] peek: update config to reflect changeset 609183 [puppet] - 10https://gerrit.wikimedia.org/r/609184 (https://phabricator.wikimedia.org/T256987) (owner: 10Rush) [15:36:28] (03PS9) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [15:37:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:18] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10Dzahn) Glad it worked, yup :) [15:38:27] (03CR) 10Jcrespo: "> Patch Set 2:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [15:38:47] (03CR) 10ZPapierski: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [15:39:31] (03CR) 10Andrew Bogott: "Maybe I'm misunderstanding with my cursory look. Isn't most of this patch adding hiera defaults and/or changing arg syntax which is unrel" [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [15:40:11] (03PS10) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [15:40:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:32] (03CR) 10Privacybatm: "> Patch Set 2:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [15:43:04] (03CR) 10Alexandros Kosiaris: Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:44:47] (03CR) 10Hashar: [C: 03+1] "The Phabricator message already contains the repository name:" [puppet] - 10https://gerrit.wikimedia.org/r/608954 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [15:44:50] (03PS11) 10ZPapierski: Handle oauth proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [15:45:38] (03PS9) 10Jbond: puppetmaster::frontend: add hiera calls and type validation [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) [15:45:40] (03PS1) 10Jbond: puppetmaster::frontend: manage ca_cert.pem [puppet] - 10https://gerrit.wikimedia.org/r/609186 (https://phabricator.wikimedia.org/T256721) [15:45:46] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10ssingh) [15:46:40] (03CR) 10Jbond: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [15:47:08] (03PS1) 10Ahmon Dancy: Remove unused `scap swat` command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609187 (https://phabricator.wikimedia.org/T254787) [15:47:45] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10akosiaris) >>! In T224041#6273176, @jeena wrote: > We've published a cassandra image to do... [15:56:18] RECOVERY - snapshot of s2 in eqiad on db2093 is OK: Last snapshot for s2 at eqiad (db1095.eqiad.wmnet:3312) taken on 2020-07-02 14:08:30 (798 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:01:20] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:40] RECOVERY - mysqld processes on dbstore1005 is OK: PROCS OK: 4 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:01:50] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) ganeti1006 looks ok, I am already moving VMs to it. emptying ganeti1007 now. [16:02:56] RECOVERY - MariaDB Replica Lag: s8 on dbstore1005 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [16:03:40] RECOVERY - MariaDB read only s8 on dbstore1005 is OK: Version 10.4.13-MariaDB, Uptime 112s, read_only: True, read_only: True, 15.62 QPS, connection latency: 0.002150s, query latency: 0.000539s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:04:04] (03PS1) 10Rush: peek: update phab project defs to primary name [puppet] - 10https://gerrit.wikimedia.org/r/609199 (https://phabricator.wikimedia.org/T256987) [16:05:33] (03CR) 10Rush: [C: 03+2] peek: update phab project defs to primary name [puppet] - 10https://gerrit.wikimedia.org/r/609199 (https://phabricator.wikimedia.org/T256987) (owner: 10Rush) [16:05:39] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:59] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Gehel) So it looks like there is a real issue with the raid config. @ryankemper, can you have a look ? [16:06:47] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10Jclark-ctr) Server Asset tag Rack Switch port icinga1001 WMF5405 c6 36 [16:07:09] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [16:15:02] RECOVERY - snapshot of s2 in codfw on db2093 is OK: Last snapshot for s2 at codfw (db2098.codfw.wmnet:3312) taken on 2020-07-02 14:26:30 (782 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:16:13] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-17) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Cmjohnson) an-test-master1001 Rack A5 Port 30 [16:18:38] msw-a3 and a4 replacement done let me know is you notice and problem thanks [16:28:42] (03CR) 10Jforrester: "I guess once this lands we can rebase Iad91f8fa6b5 for the proper removal of the plugins. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609187 (https://phabricator.wikimedia.org/T254787) (owner: 10Ahmon Dancy) [16:32:42] 10Operations, 10Sustainability, 10codfw-rollout: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krinkle) [16:35:02] PROBLEM - Host puppetmaster2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:38:06] 10Operations, 10Sustainability, 10codfw-rollout: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Krinkle) Per today's Multi-DC meeting, I'm detaching this from the current workboard. It was our understanding that the messages here are largerely and perhaps even exc... [16:39:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:40:56] RECOVERY - Host puppetmaster2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.45 ms [16:41:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:44:44] RECOVERY - snapshot of s5 in eqiad on db2093 is OK: Last snapshot for s5 at eqiad (db1145.eqiad.wmnet:3315) taken on 2020-07-02 15:44:40 (675 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:52:29] (03PS1) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609206 (https://phabricator.wikimedia.org/T251279) [16:53:21] (03CR) 10Cicalese: [C: 04-1] "This cannot be merged until all production branches have the skin and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609206 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [16:54:10] RECOVERY - snapshot of s5 in codfw on db2093 is OK: Last snapshot for s5 at codfw (db2099.codfw.wmnet:3315) taken on 2020-07-02 15:44:56 (673 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:55:57] (03CR) 10EBernhardson: Scale largest shards to be closer to 30GB (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608965 (https://phabricator.wikimedia.org/T256928) (owner: 10Ryan Kemper) [16:58:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add helm-charts discovery records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [17:00:01] (03PS6) 10ZPapierski: Authenticate with MW oauth 1.0a for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) [17:00:45] (03PS2) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609206 (https://phabricator.wikimedia.org/T251279) [17:00:47] (03PS1) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609210 (https://phabricator.wikimedia.org/T251279) [17:02:23] (03CR) 10Cicalese: [C: 04-1] "This cannot be merged until all production branches have the skin and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609210 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [17:06:49] (03PS1) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) [17:07:25] (03CR) 10Cicalese: [C: 04-1] "This cannot be merged until all production branches have the skin and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [17:11:31] (03PS1) 10Cicalese: Install WikimediaApiPortal/WikimediaApiPortalOAuth - IV: Enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609215 (https://phabricator.wikimedia.org/T251279) [17:12:20] (03CR) 10Cicalese: [C: 04-1] "This cannot be merged until all production branches have the skin and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609215 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [17:22:14] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10faidon) [17:23:34] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10faidon) So - how do we make progress here? Any thoughts on who/how? :) Some of these features could really make a tremendous amount of difference to our network operati... [17:32:38] (03CR) 10Cicalese: [C: 04-1] "This is also blocked on merging https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaApiPortalOAuth/+/597553." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609206 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [17:32:56] (03CR) 10Cicalese: [C: 04-1] "This is also blocked on merging https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaApiPortalOAuth/+/597553." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609210 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [17:33:04] (03CR) 10Cicalese: [C: 04-1] "This is also blocked on merging https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaApiPortalOAuth/+/597553." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [17:33:09] (03CR) 10Cicalese: [C: 04-1] "This is also blocked on merging https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaApiPortalOAuth/+/597553." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609215 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [17:36:27] (03CR) 10Bstorm: [C: 03+2] toolforge: Update comment reflecting source of tesseract packages [puppet] - 10https://gerrit.wikimedia.org/r/608966 (https://phabricator.wikimedia.org/T256881) (owner: 10Legoktm) [17:43:37] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10ssingh) [17:45:56] (03CR) 10Dzahn: [C: 03+2] hiera: delete yaml files for non-existing hosts [puppet] - 10https://gerrit.wikimedia.org/r/608955 (owner: 10Dzahn) [17:57:36] (03PS1) 10Ssingh: admin: add Chris Albon to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/609217 (https://phabricator.wikimedia.org/T256412) [18:04:50] (03CR) 10Dzahn: [V: 03+1] "checked LDAP for UID, it matches. checked the key matches what was supplied on ticket, it does." [puppet] - 10https://gerrit.wikimedia.org/r/609217 (https://phabricator.wikimedia.org/T256412) (owner: 10Ssingh) [18:15:34] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Dzahn) a:05Gehel→03RKemper [18:17:04] (03CR) 10Dzahn: [C: 03+2] planet: remove broken feeds, update feed URLs [puppet] - 10https://gerrit.wikimedia.org/r/608974 (https://phabricator.wikimedia.org/T168459) (owner: 10Dzahn) [18:22:20] RECOVERY - snapshot of s6 in eqiad on db2093 is OK: Last snapshot for s6 at eqiad (db1139.eqiad.wmnet:3316) taken on 2020-07-02 17:27:05 (502 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [18:29:24] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10ssingh) Patch is ready to merge. @Nuria: This requires your approval for the analytics-privatedata-u... [18:31:38] RECOVERY - snapshot of s6 in codfw on db2093 is OK: Last snapshot for s6 at codfw (db2097.codfw.wmnet:3316) taken on 2020-07-02 17:25:28 (511 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [18:37:53] (03CR) 10Ppchelko: "I believe that was supreceeded with a shared chart and can be abandoned." [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [18:37:59] (03Abandoned) 10Ppchelko: Added new chart for cpjobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [18:47:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:47:13] (03PS7) 10ZPapierski: Authenticate with MW oauth 1.0a for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) [18:49:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:09:13] (03PS8) 10ZPapierski: Authenticate with MW oauth 1.0a for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) [19:13:10] (03PS9) 10ZPapierski: Authenticate with MW oauth 1.0a for WCQS [puppet] - 10https://gerrit.wikimedia.org/r/608905 (https://phabricator.wikimedia.org/T251498) [19:14:46] 10Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091 (10Nemo_bis) 05Open→03Declined For now I'm marking this declined for clarity, given the pro-hotlinking policy has been standing for 12 years unchallenged. Should the Wikimedia Foundation start a radical reconsideration of... [19:24:01] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10ssingh) @Nuria: This requires your approval for the analytics-privatedata-users part. Thank you. [19:28:10] (03PS1) 10Ssingh: admin: add Ahmon Dancy to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/609225 (https://phabricator.wikimedia.org/T256770) [19:31:11] (03PS2) 10Cicalese: DO NOT MERGE Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T247943) [19:34:17] (03PS1) 10Gergő Tisza: GrowthExperiments: shorten welcome survey retention window [puppet] - 10https://gerrit.wikimedia.org/r/609226 (https://phabricator.wikimedia.org/T252575) [19:37:56] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/609225 (https://phabricator.wikimedia.org/T256770) (owner: 10Ssingh) [19:47:25] (03PS5) 10Dzahn: gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) [19:49:25] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23642/" [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn) [19:52:08] !log gerrit2001 - restarting gerrit after removing db_pass from config [19:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:29] (03PS1) 10Mholloway: Add echo_push_subscription to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/609228 (https://phabricator.wikimedia.org/T246716) [19:56:50] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:01:06] (03PS1) 10Andrew Bogott: dynamic proxy: use flask_sqlalchemy instead of flask.ext.sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/609229 [20:02:00] (03CR) 10Andrew Bogott: [C: 03+2] dynamic proxy: use flask_sqlalchemy instead of flask.ext.sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/609229 (owner: 10Andrew Bogott) [20:05:56] (03PS2) 10Dzahn: gerrit: Drop old version [puppet] - 10https://gerrit.wikimedia.org/r/608931 (owner: 10QChris) [20:10:24] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23643/" [puppet] - 10https://gerrit.wikimedia.org/r/608931 (owner: 10QChris) [20:12:09] (03PS2) 10Dzahn: gerrit: Move new version's homedir to default place [puppet] - 10https://gerrit.wikimedia.org/r/608932 (owner: 10QChris) [20:12:26] 10Operations, 10Graphoid, 10serviceops: Delay spinner showing for graphs for 1s - https://phabricator.wikimedia.org/T256641 (10Krinkle) [20:14:32] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23644/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/608932 (owner: 10QChris) [20:15:50] (03PS2) 10Dzahn: gerrit: Drop removal of javamelody-deps jar [puppet] - 10https://gerrit.wikimedia.org/r/608933 (owner: 10QChris) [20:17:41] (03CR) 10Dzahn: "confirmed file is gone on all 3 prod servers" [puppet] - 10https://gerrit.wikimedia.org/r/608933 (owner: 10QChris) [20:18:16] (03PS2) 10Dzahn: gerrit: Make implicit its templates explicit [puppet] - 10https://gerrit.wikimedia.org/r/608953 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [20:19:29] (03CR) 10Dzahn: [C: 03+2] gerrit: Make implicit its templates explicit [puppet] - 10https://gerrit.wikimedia.org/r/608953 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [20:20:34] !log reindexing frwikibooks to test https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/604221 [20:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:08] (03PS2) 10Dzahn: gerrit: Format short gerrit URLs in phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/608954 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [20:23:05] (03CR) 10Dzahn: [C: 03+2] gerrit: Format short gerrit URLs in phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/608954 (https://phabricator.wikimedia.org/T256729) (owner: 10QChris) [20:25:22] (03CR) 10QChris: [C: 03+1] gerrit: Prevent new line being added under javaOptions [puppet] - 10https://gerrit.wikimedia.org/r/608672 (owner: 10Paladox) [20:28:03] (03CR) 10Dzahn: [C: 03+2] gerrit: Prevent new line being added under javaOptions [puppet] - 10https://gerrit.wikimedia.org/r/608672 (owner: 10Paladox) [20:29:53] (03CR) 10Reedy: Install WikimediaApiPortal/WikimediaApiPortalOAuth - III: Install where enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [20:34:28] RECOVERY - snapshot of s7 in codfw on db2093 is OK: Last snapshot for s7 at codfw (db2100.codfw.wmnet:3317) taken on 2020-07-02 17:20:18 (998 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [20:34:39] (03PS1) 10QChris: gerrit: Fix its template name of ITS merged comment [puppet] - 10https://gerrit.wikimedia.org/r/609234 [20:35:44] (03PS2) 10QChris: gerrit: Fix ITS template name of merged commits [puppet] - 10https://gerrit.wikimedia.org/r/609234 [20:35:57] (03PS4) 10Paladox: gerrit: Prevent new line being added under javaOptions [puppet] - 10https://gerrit.wikimedia.org/r/608672 [20:36:37] (03CR) 10Dzahn: [C: 03+2] gerrit: Prevent new line being added under javaOptions [puppet] - 10https://gerrit.wikimedia.org/r/608672 (owner: 10Paladox) [20:38:58] (03PS3) 10Dzahn: gerrit: Fix ITS template name of merged commits [puppet] - 10https://gerrit.wikimedia.org/r/609234 (owner: 10QChris) [20:39:31] (03CR) 10Dzahn: [C: 03+2] gerrit: Fix ITS template name of merged commits [puppet] - 10https://gerrit.wikimedia.org/r/609234 (owner: 10QChris) [20:42:39] Thanks mutante! Fix is live. [20:42:50] qchris: :) great [20:42:59] so most of the cleanup is done [20:43:07] looking at another one from paladox [20:43:22] https://gerrit.wikimedia.org/r/c/operations/puppet/+/532391 [20:43:27] paladox: ^ [20:43:40] * paladox can rebase [20:43:50] qchris: ^ you said it's cosmetics, but now is the merge of the cosmetics i guess:) [20:44:18] (03PS6) 10Paladox: Revert "Gerrit: Set base url for commitlink" [puppet] - 10https://gerrit.wikimedia.org/r/532391 [20:44:22] RECOVERY - snapshot of s7 in eqiad on db2093 is OK: Last snapshot for s7 at eqiad (db1116.eqiad.wmnet:3317) taken on 2020-07-02 17:06:30 (976 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [20:44:37] (03PS7) 10Paladox: Revert "Gerrit: Set base url for commitlink" [puppet] - 10https://gerrit.wikimedia.org/r/532391 [20:44:52] (03CR) 10Paladox: Revert "Gerrit: Set base url for commitlink" [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [20:44:56] 1sec. Still verifying the fix of the its-phabricator issue. [20:45:31] acutally [20:45:32] yep, no rush. grabbing coffee [20:45:35] need to remove the # [20:45:39] since that's gwtui [20:45:49] aha, paladox :) it's an old change [20:45:54] i like the cleanup though [20:46:04] for my own incoming queue :) [20:47:57] (03CR) 10Dzahn: [C: 04-2] "We solved it in cloud without needing an LDAP server. This can be abandoned i think." [puppet] - 10https://gerrit.wikimedia.org/r/539211 (owner: 10Paladox) [20:48:29] (03PS8) 10Paladox: Revert "Gerrit: Set base url for commitlink" [puppet] - 10https://gerrit.wikimedia.org/r/532391 [20:48:39] mutante [20:48:47] tested the above quickly and works now [20:49:42] Are you sure? Let me double-check. [20:49:43] paladox: thanks! and meanwhile i commented on 539211 [20:50:00] qchris yup [20:50:07] (03CR) 10Cicalese: [C: 04-1] Install WikimediaApiPortal/WikimediaApiPortalOAuth - III: Install where enabled (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [20:50:11] qchris generated http://gerrit.test/r/q/I8b4b943a1883548d8616d8f455c7389c5b5c1a84 [20:50:11] 606387 needs manual rebase. taking that [20:50:49] (03CR) 10Paladox: "Can be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/606387 (https://phabricator.wikimedia.org/T255715) (owner: 10Jcrespo) [20:50:51] (03CR) 10QChris: [C: 04-1] Revert "Gerrit: Set base url for commitlink" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [20:52:17] The issue is that in the first one, it's `link=...` but in the second one it's `html = ...` so the href has no chance of gerrit prefixed by a `/r` [20:53:01] (03CR) 10Dzahn: "yes, mostly true because of https://gerrit.wikimedia.org/r/c/operations/puppet/+/606549 but it still tells me i need to cleanup in private" [puppet] - 10https://gerrit.wikimedia.org/r/606387 (https://phabricator.wikimedia.org/T255715) (owner: 10Jcrespo) [20:53:24] qchris did you set a base url? [20:53:32] as it works for me [20:53:59] Does foo respect a base url? [20:54:03] (03CR) 10Reedy: Install WikimediaApiPortal/WikimediaApiPortalOAuth - III: Install where enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [20:54:03] How? [20:54:23] qchris see https://gerrit-review.googlesource.com/c/gerrit/+/234694 [20:54:31] polygerrit adds the base url [20:55:10] (03CR) 10Dzahn: "DBAs added to CC: just wanted to let you know gerrit is definitely not using mysql anymore now" [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn) [20:55:30] https://github.com/GerritCodeReview/gerrit/blob/master/polygerrit-ui/app/elements/shared/gr-linked-text/link-text-parser.js#L174 is the newer code [20:55:33] Ha! Learnt something :-) Thanks. [20:55:54] and for html https://github.com/GerritCodeReview/gerrit/blob/master/polygerrit-ui/app/elements/shared/gr-linked-text/link-text-parser.js#L196 [20:57:06] (03CR) 10QChris: Revert "Gerrit: Set base url for commitlink" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [21:00:49] (03PS2) 10Dzahn: Revert "gerrit: add parameter for db_name, let gerrit1002 use test db" [puppet] - 10https://gerrit.wikimedia.org/r/606387 (https://phabricator.wikimedia.org/T255715) (owner: 10Jcrespo) [21:02:05] (03CR) 10jerkins-bot: [V: 04-1] Revert "gerrit: add parameter for db_name, let gerrit1002 use test db" [puppet] - 10https://gerrit.wikimedia.org/r/606387 (https://phabricator.wikimedia.org/T255715) (owner: 10Jcrespo) [21:02:08] (03CR) 10Cicalese: [C: 04-1] Install WikimediaApiPortal/WikimediaApiPortalOAuth - III: Install where enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609211 (https://phabricator.wikimedia.org/T251279) (owner: 10Cicalese) [21:07:16] (03CR) 10Dzahn: [C: 04-1] "manually rebased and these conflicts are left. confirmed this is really already done" [puppet] - 10https://gerrit.wikimedia.org/r/606387 (https://phabricator.wikimedia.org/T255715) (owner: 10Jcrespo) [21:07:34] (03Abandoned) 10Dzahn: Revert "gerrit: add parameter for db_name, let gerrit1002 use test db" [puppet] - 10https://gerrit.wikimedia.org/r/606387 (https://phabricator.wikimedia.org/T255715) (owner: 10Jcrespo) [21:10:24] (03CR) 10Dzahn: [C: 03+2] Revert "Gerrit: Set base url for commitlink" [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [21:11:40] (03CR) 10QChris: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [21:11:45] (03CR) 10Dzahn: [C: 03+2] Revert "Gerrit: Set base url for commitlink" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [21:13:22] paladox: qchris: deployed to 1002. please test on gerrit-test. puppet disabled on prod. [21:13:32] Will do. [21:13:36] testing [21:13:51] mutante did you restart gerrit? [21:13:56] since it still has the # [21:14:01] Still has hashmark for me. [21:14:01] not yet, heh [21:14:25] !log gerrit1002 restarted gerrit [21:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:51] mutante works! [21:14:58] :) [21:15:06] Works for me too. [21:15:18] re-enabling puppet [21:16:54] paladox: https://gerrit.wikimedia.org/r/c/operations/puppet/+/539211 not needed anymore, we found another solution. ? [21:17:16] i mean i know we did, just if you agree we can abandon [21:17:16] yeh, i think we're going with the "dev account" config [21:17:23] (03Abandoned) 10Paladox: Gerrit: Allow configuring accountPattern [puppet] - 10https://gerrit.wikimedia.org/r/539211 (owner: 10Paladox) [21:17:23] yes, agree. that [21:17:46] ok, so i do you guys see anything else for now? [21:17:55] (03CR) 10QChris: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [21:18:26] i need to restart prod gerrit service then [21:18:37] combining a few minor changes [21:20:34] remove additivity from log4j config.. it seems to make sense to me but other review would be welcome [21:20:45] that is https://gerrit.wikimedia.org/r/c/operations/puppet/+/508657/6/modules/gerrit/templates/log4j.xml.erb [21:21:53] https://veerasundar.com/blog/2009/08/log4j-tutorial-additivity-what-and-why/ [21:22:13] hmm "git fetch "ssh://paladox@gerrit.wikimedia.org:29418/operations/puppet" refs/changes/65/556265/10 && git cherry-pick FETCH_HEAD" is just very slowwwww. [21:22:17] so it should avoid the redudancy [21:23:11] mutante: My gist: I'd not remove it. Let me chime in on the change. [21:23:27] Oh. I did already :-D [21:24:04] paladox: do you see an actual problem with the duplicate logs ? [21:24:12] no [21:24:24] well then let's keep it as it is i guess [21:24:43] (03CR) 10QChris: [C: 04-1] "I did not vote last time, so let me do it now." [puppet] - 10https://gerrit.wikimedia.org/r/508657 (owner: 10Paladox) [21:25:31] !log gerrit2001 - restarted gerrit [21:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:20] RECOVERY - snapshot of x1 in eqiad on db2093 is OK: Last snapshot for x1 at eqiad (db1102.eqiad.wmnet:3320) taken on 2020-07-02 20:39:05 (182 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:26:45] (03CR) 10Dzahn: [C: 04-1] "we don't have an actual issue to solve or a ticket linked to it and it's been here for a while and we are still unsure, so let's just not " [puppet] - 10https://gerrit.wikimedia.org/r/508657 (owner: 10Paladox) [21:27:00] (03PS11) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) [21:27:27] (03CR) 10QChris: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/509542 (owner: 10Paladox) [21:29:15] cleaning up db config stuff for gerrit in private repo, brb [21:30:11] (03PS14) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) [21:30:14] (03CR) 10QChris: "I haven't yet found time to properly test this (and the ecdsa follow-up change)." [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [21:30:47] 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [21:31:45] (03PS15) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) [21:32:08] !log gerrit - deleted gerrit db_pass from prod private repo, running puppet [21:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:38] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:33:45] (03Abandoned) 10Paladox: Gerrit: Convert log4j.xml to log4j2.xml [puppet] - 10https://gerrit.wikimedia.org/r/509542 (owner: 10Paladox) [21:33:55] there is the expected $gerrit_db_pass in 2 separate places , passwords module and private hiera [21:34:12] and then there is also a $gerrit_pass without "db" in it [21:36:31] removed the second gerrit_db_pass, not touching gerrit_pass so far [21:37:29] (03PS1) 10Dzahn: remove gerrit db_pass, it does not use mysql anymore [labs/private] - 10https://gerrit.wikimedia.org/r/609241 [21:38:16] (03PS2) 10Dzahn: remove gerrit db_pass, it does not use mysql anymore [labs/private] - 10https://gerrit.wikimedia.org/r/609241 (https://phabricator.wikimedia.org/T255715) [21:39:15] (03CR) 10Dzahn: "also removed gerrit db_pass in 3 places: private hieradata, private passwords module and labs/private fake password" [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn) [21:40:00] (03CR) 10Dzahn: [V: 03+2 C: 03+2] remove gerrit db_pass, it does not use mysql anymore [labs/private] - 10https://gerrit.wikimedia.org/r/609241 (https://phabricator.wikimedia.org/T255715) (owner: 10Dzahn) [21:41:51] (03Restored) 10Paladox: Gerrit: Convert log4j.xml to log4j2.xml [puppet] - 10https://gerrit.wikimedia.org/r/509542 (owner: 10Paladox) [21:41:59] (03CR) 10Paladox: [C: 04-1] "Not ready" [puppet] - 10https://gerrit.wikimedia.org/r/509542 (owner: 10Paladox) [21:42:02] RECOVERY - snapshot of s3 in codfw on db2093 is OK: Last snapshot for s3 at codfw (db2098.codfw.wmnet:3313) taken on 2020-07-02 19:35:57 (912 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:42:09] (03Abandoned) 10Paladox: Gerrit: Remove additivity from log4j.xml file [puppet] - 10https://gerrit.wikimedia.org/r/508657 (owner: 10Paladox) [21:42:53] (03CR) 10Dzahn: "Could use a little info when and why we want or need to upgrade it to version 2." [puppet] - 10https://gerrit.wikimedia.org/r/509542 (owner: 10Paladox) [21:45:59] (03CR) 10Paladox: [C: 04-1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/509542 (owner: 10Paladox) [21:46:02] (03CR) 10Dzahn: "ran across this by chance because new gerrit UI shows me what i am CCed on and still open. is this really blocked since 2019? Could it be" [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper) [21:46:11] that should have been removed a while ago :/ [21:46:40] (03CR) 10Dzahn: "still requested?" [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper) [21:47:52] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/556265 would have to happen before this, is that the right order?" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [21:52:25] !log frwikibooks reindex sucessful, continuing on with remainder of french wikis [21:52:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10ssingh) [21:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:01] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10ssingh) [21:53:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10ssingh) [21:53:35] (03PS1) 10Mholloway: Wikifeeds: Update to 2020-07-02-214619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/609246 (https://phabricator.wikimedia.org/T255198) [21:54:14] jouncebot: next [21:54:14] In 9 hour(s) and 5 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200703T0700) [21:54:18] jouncebot: now [21:54:18] For the next 9 hour(s) and 5 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200702T0700) [21:55:27] (03CR) 10Mholloway: [C: 04-1] "For deployment next Monday, July 6." [deployment-charts] - 10https://gerrit.wikimedia.org/r/609246 (https://phabricator.wikimedia.org/T255198) (owner: 10Mholloway) [21:56:11] !log gerrit1001 (prod gerrit) - restarting gerrit service [21:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:52] gerrit was restarted and is back after some final cleanup changes [21:56:58] we stop touching it now [21:57:05] lgtm :) [21:57:25] nice, so _definitely_ not using mysql anymore and fixed a couple UI things [21:57:36] and a lot of code cleanup [21:57:58] :-) [21:58:37] nice! [22:00:25] what were the ui things, out of curiosity? [22:03:02] gerrit urls in phab comments. commitlinks in footer. eh... [22:03:04] paladox: ^ [22:03:25] yeh [22:03:30] ITS template name of merged commits, format short urls [22:03:38] !log jhuneidi@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [22:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:19] cool, I'll keep my eyes out for that [22:10:12] RECOVERY - snapshot of x1 in codfw on db2093 is OK: Last snapshot for x1 at codfw (db2101.codfw.wmnet:3320) taken on 2020-07-02 20:59:11 (215 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [22:11:35] thanks apergos [22:17:34] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [22:24:57] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) > If the kask cassandra subchart already works, do we really need another cassandr... [22:25:45] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10jeena) [22:28:31] (03PS2) 10Dzahn: site: add new POP install servers with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/601342 (https://phabricator.wikimedia.org/T252526) [22:28:45] (03CR) 10jerkins-bot: [V: 04-1] site: add new POP install servers with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/601342 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [23:00:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:06:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:08:05] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10RKemper) (Meant to comment earlier, but I'm looking into the cause of the RAID failure) [23:10:30] (03Abandoned) 10Aklapper: Phab: Allow Greg and Andre to roll back specific user actions [puppet] - 10https://gerrit.wikimedia.org/r/520433 (owner: 10Aklapper) [23:10:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:16:55] !log jhuneidi@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [23:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:00] !log jhuneidi@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [23:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:29] (03CR) 10Dzahn: "Bryan mentioned we might move toolserver_legacy to tools. That would get the list down to 2 still using it. Or we can just get it merged a" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [23:26:29] (03CR) 10Dzahn: "another user is profile::mail::smarthost which is also wmcs. the one thing left for production seems to be tlsproxy::localssl for traffic" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [23:42:46] RECOVERY - snapshot of s3 in eqiad on db2093 is OK: Last snapshot for s3 at eqiad (db1095.eqiad.wmnet:3313) taken on 2020-07-02 19:40:15 (919 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting