[00:18:35] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1112.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:31:17] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:28:21] PROBLEM - snapshot of s5 in eqiad on db1115 is CRITICAL: snapshot for s5 at eqiad taken more than 4 days ago: Most recent backup 2020-01-17 02:54:31 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [05:22:52] (03PS1) 10KartikMistry: Move CX out of beta for af, is, lv and ne WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566123 (https://phabricator.wikimedia.org/T242011) [05:24:01] (03CR) 10jerkins-bot: [V: 04-1] Move CX out of beta for af, is, lv and ne WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566123 (https://phabricator.wikimedia.org/T242011) (owner: 10KartikMistry) [05:35:21] PROBLEM - snapshot of s7 in eqiad on db1115 is CRITICAL: snapshot for s7 at eqiad taken more than 4 days ago: Most recent backup 2020-01-17 05:13:20 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [05:45:55] 10Operations, 10Performance-Team, 10serviceops: Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10Marostegui) >>! In T243149#5818385, @aaron wrote: > As long as there are any health checks that hit MediaWiki in codfw that involve DB... [05:50:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2084:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10230 and previous config saved to /var/cache/conftool/dbconfig/20200121-055023-marostegui.json [05:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:30] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [05:51:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2091:3314 T239453', diff saved to https://phabricator.wikimedia.org/P10231 and previous config saved to /var/cache/conftool/dbconfig/20200121-055149-marostegui.json [05:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:18] !log Remove partitions from db2091:3314 - T239453 [05:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:24] (03PS1) 10Marostegui: mariadb: Move es2022 from spare to es4 slave [puppet] - 10https://gerrit.wikimedia.org/r/566129 (https://phabricator.wikimedia.org/T243052) [05:58:29] !log Stop MySQL on es2021 to clone es2022 - T243052 [05:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:32] (03CR) 10Marostegui: [C: 03+2] mariadb: Move es2022 from spare to es4 slave [puppet] - 10https://gerrit.wikimedia.org/r/566129 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [05:58:33] T243052: Productionize es1020-es1025, es2020-es2025 - https://phabricator.wikimedia.org/T243052 [06:05:23] !log Stop replication on db1107 [06:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:06] (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566123 (https://phabricator.wikimedia.org/T242011) (owner: 10KartikMistry) [06:15:48] 10Operations, 10ops-codfw, 10DBA: db2085 crashed - memory issues - https://phabricator.wikimedia.org/T243148 (10Marostegui) `enwiki` data check finished without any issues. `wikidatawiki` check still on-going [06:17:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 for upgrade', diff saved to https://phabricator.wikimedia.org/P10232 and previous config saved to /var/cache/conftool/dbconfig/20200121-061756-marostegui.json [06:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:04] !log Upgrade db1087 [06:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:57] !log Aborted upgrade on db1087 (wiki dumps are running) [06:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087', diff saved to https://phabricator.wikimedia.org/P10233 and previous config saved to /var/cache/conftool/dbconfig/20200121-061932-marostegui.json [06:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:41] (03PS1) 10Marostegui: install_server: Do not reimage es2022 [puppet] - 10https://gerrit.wikimedia.org/r/566131 (https://phabricator.wikimedia.org/T243052) [06:23:00] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage es2022 [puppet] - 10https://gerrit.wikimedia.org/r/566131 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [06:24:29] (03PS2) 10Marostegui: mariadb: remove grants for users on phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/552607 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [06:28:24] !log Remove the following users from phabricator database: 'phadmin'@'10.64.48.21' 'phuser'@'10.64.48.21' 'phstats'@'10.64.48.21' 'phmanifest'@'10.64.48.21' T238957 [06:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:27] T238957: decommission phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T238957 [06:28:30] (03CR) 10Marostegui: [C: 03+2] mariadb: remove grants for users on phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/552607 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [06:28:56] (03CR) 10Marostegui: [C: 03+2] "Users removed from the DB" [puppet] - 10https://gerrit.wikimedia.org/r/552607 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [06:37:38] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please use the helpers for TLS setup" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [06:39:34] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please add TLS termination, see cxserver or termbox as examples." [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [06:39:50] (03PS1) 10Marostegui: tables_to_check.txt: Add 3 more tables to check [software] - 10https://gerrit.wikimedia.org/r/566132 [06:40:45] (03CR) 10Marostegui: [C: 03+2] tables_to_check.txt: Add 3 more tables to check [software] - 10https://gerrit.wikimedia.org/r/566132 (owner: 10Marostegui) [06:41:42] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:43:22] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:00:09] (03PS2) 10Giuseppe Lavagetto: citoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558091 (https://phabricator.wikimedia.org/T235411) [07:01:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] citoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558091 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [07:01:36] (03Merged) 10jenkins-bot: citoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558091 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [07:10:54] (03PS1) 10Giuseppe Lavagetto: Fix citoid chart tar [deployment-charts] - 10https://gerrit.wikimedia.org/r/566158 [07:10:56] (03PS1) 10Giuseppe Lavagetto: Enable TLS on citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/566159 [07:11:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix citoid chart tar [deployment-charts] - 10https://gerrit.wikimedia.org/r/566158 (owner: 10Giuseppe Lavagetto) [07:11:25] (03Merged) 10jenkins-bot: Fix citoid chart tar [deployment-charts] - 10https://gerrit.wikimedia.org/r/566158 (owner: 10Giuseppe Lavagetto) [07:12:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Enable TLS on citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/566159 (owner: 10Giuseppe Lavagetto) [07:13:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Enable TLS on citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/566159 (owner: 10Giuseppe Lavagetto) [07:13:50] (03Merged) 10jenkins-bot: Enable TLS on citoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/566159 (owner: 10Giuseppe Lavagetto) [07:20:25] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' . [07:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:26] <_joe_> !log adding TLS to citoid in production [07:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:01] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'citoid' for release 'production' . [07:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:26] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' . [07:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:21] (03PS3) 10Elukey: profile::analytics::cluster::packages::common: add libcrypto.so link [puppet] - 10https://gerrit.wikimedia.org/r/566062 (https://phabricator.wikimedia.org/T240934) [07:47:23] (03PS1) 10Elukey: Set Spark2 encryption options as default for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/566231 (https://phabricator.wikimedia.org/T240934) [08:01:44] 10Operations, 10ops-codfw, 10DBA: db2085 crashed - memory issues - https://phabricator.wikimedia.org/T243148 (10Marostegui) `wikidatawiki` data check finished without any drifts. [08:12:50] (03CR) 10Muehlenhoff: [C: 03+1] "Suggestion for the comment/docs inline:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566062 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [08:19:17] (03PS3) 10Muehlenhoff: Switch DNS servers and contemporary LVSes to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/561837 (https://phabricator.wikimedia.org/T156955) [08:24:25] (03CR) 10Muehlenhoff: [C: 03+2] Switch DNS servers and contemporary LVSes to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/561837 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:25:09] (03PS2) 10Muehlenhoff: dnsrecursor: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/543803 [08:27:01] (03CR) 10Muehlenhoff: [C: 03+2] dnsrecursor: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/543803 (owner: 10Muehlenhoff) [08:38:16] (03PS1) 10Muehlenhoff: Switch some analytics roles to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566237 (https://phabricator.wikimedia.org/T156955) [08:40:08] (03CR) 10DCausse: [C: 03+1] [cirrus] A/B test for MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [08:52:45] (03PS4) 10Elukey: profile::analytics::cluster::packages::common: add libcrypto.so link [puppet] - 10https://gerrit.wikimedia.org/r/566062 (https://phabricator.wikimedia.org/T240934) [08:52:47] (03PS2) 10Elukey: Set Spark2 encryption options as default for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/566231 (https://phabricator.wikimedia.org/T240934) [08:53:33] ACKNOWLEDGEMENT - MariaDB Slave Lag: s8 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 435.87 seconds Marostegui long query https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:56:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/566062 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [09:09:26] (03PS1) 10Elukey: admin: add kerberos flag for user dsaez [puppet] - 10https://gerrit.wikimedia.org/r/566239 [09:11:48] (03CR) 10Elukey: [C: 03+2] "Relevant info:" [puppet] - 10https://gerrit.wikimedia.org/r/566239 (owner: 10Elukey) [09:12:08] PROBLEM - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 4 days ago: Most recent backup 2020-01-17 08:53:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:37:32] (03PS2) 10Arturo Borrero Gonzalez: [RFC] kubernetes: add support for multiple objects of any kind [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) [09:50:56] (03PS1) 10Muehlenhoff: Remove references to obsolete aliases in Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/566242 [09:51:15] (03PS2) 10Giuseppe Lavagetto: mathoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) [09:53:02] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/566242 (owner: 10Muehlenhoff) [09:53:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove references to obsolete aliases in Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/566242 (owner: 10Muehlenhoff) [10:03:32] (03CR) 10Ema: [C: 03+1] "I buy it" [puppet] - 10https://gerrit.wikimedia.org/r/564708 (https://phabricator.wikimedia.org/T242620) (owner: 10Vgutierrez) [10:05:52] !log volans@cumin2001 START - Cookbook sre.hosts.downtime [10:05:53] !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:41] (03PS2) 10Ema: netconsole:: rename to netconsole::client [puppet] - 10https://gerrit.wikimedia.org/r/566063 (https://phabricator.wikimedia.org/T242579) [10:11:43] (03PS2) 10Ema: netconsole: add netconsole::server [puppet] - 10https://gerrit.wikimedia.org/r/566064 (https://phabricator.wikimedia.org/T242579) [10:14:02] (03CR) 10Vgutierrez: [C: 03+2] ATS: Set connect timeout and TTFB timeouts to different values [puppet] - 10https://gerrit.wikimedia.org/r/564708 (https://phabricator.wikimedia.org/T242620) (owner: 10Vgutierrez) [10:14:55] 10Operations, 10DNS, 10Domains, 10Traffic: Donate wikiźródła.pl and wikisłownik.pl to the Foundation - https://phabricator.wikimedia.org/T240446 (10tomasz) 05Open→03Stalled [10:15:57] (03PS3) 10Arturo Borrero Gonzalez: [RFC] kubernetes: add support for multiple objects of any kind [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) [10:15:59] (03PS5) 10Arturo Borrero Gonzalez: kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) [10:16:45] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [10:18:49] (03PS4) 10Mvolz: Update citoid to 30f793422 (staging cluster only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565261 [10:18:53] (03PS2) 10Mvolz: Update zotero to 5953b26 (staging only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565262 [10:23:32] 10Operations, 10Citoid, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Requesting +2 rights for Mvolz for operations/deployment-charts - https://phabricator.wikimedia.org/T243070 (10Mvolz) 05Resolved→03Open [10:23:35] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Mvolz) [10:24:03] 10Operations, 10Citoid, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Requesting +2 rights for Mvolz for operations/deployment-charts - https://phabricator.wikimedia.org/T243070 (10Mvolz) Unfortunately I'm still not seeing the ability to +2 :/ [10:25:30] (03CR) 10Filippo Giunchedi: [C: 03+2] thumbor: ship logs to localhost / logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/566069 (https://phabricator.wikimedia.org/T242609) (owner: 10Filippo Giunchedi) [10:25:39] (03PS2) 10Filippo Giunchedi: thumbor: ship logs to localhost / logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/566069 (https://phabricator.wikimedia.org/T242609) [10:28:45] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 4:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [10:33:48] (03CR) 10Mvolz: [C: 03+2] Update citoid to 30f793422 (staging cluster only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565261 (owner: 10Mvolz) [10:34:00] (03CR) 10Mvolz: [C: 03+2] Update zotero to 5953b26 (staging only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565262 (owner: 10Mvolz) [10:34:03] (03Merged) 10jenkins-bot: Update citoid to 30f793422 (staging cluster only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565261 (owner: 10Mvolz) [10:34:17] (03PS3) 10Alexandros Kosiaris: Update zotero to 5953b26 (staging only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565262 (owner: 10Mvolz) [10:34:20] (03CR) 10Alexandros Kosiaris: [V: 03+2] Update zotero to 5953b26 (staging only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565262 (owner: 10Mvolz) [10:36:32] !log roll-restart thumbor after https://gerrit.wikimedia.org/r/c/operations/puppet/+/566069 [10:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:45] (03CR) 10Vgutierrez: [C: 03+1] netconsole:: rename to netconsole::client [puppet] - 10https://gerrit.wikimedia.org/r/566063 (https://phabricator.wikimedia.org/T242579) (owner: 10Ema) [10:38:05] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' . [10:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:26] !log mvolz@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' . [10:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:04] (03CR) 10Ema: [C: 03+2] netconsole:: rename to netconsole::client [puppet] - 10https://gerrit.wikimedia.org/r/566063 (https://phabricator.wikimedia.org/T242579) (owner: 10Ema) [10:47:18] (03CR) 10Vgutierrez: [C: 03+1] Reset waitIndex on etcd error 401 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/557024 (https://phabricator.wikimedia.org/T169765) (owner: 10Giuseppe Lavagetto) [10:47:48] _joe_: ^^ I can handle the release and the upgrade process [10:47:54] !log mvolz@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' . [10:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:27] <_joe_> vgutierrez: oh thanks :)) [10:49:08] oh don't thank me, it's just yet another attempt of winning my t-shirt [10:49:39] * ema has high hopes [10:55:29] (03CR) 10Vgutierrez: [C: 03+2] Reset waitIndex on etcd error 401 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/557024 (https://phabricator.wikimedia.org/T169765) (owner: 10Giuseppe Lavagetto) [11:00:22] 10Operations, 10Citoid, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Requesting +2 rights for Mvolz for operations/deployment-charts - https://phabricator.wikimedia.org/T243070 (10akosiaris) 05Open→03Resolved We resolved this live in a hangout with @Mvolz. Re-resolving [11:00:25] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10akosiaris) [11:01:57] (03PS2) 10Filippo Giunchedi: cacheproxy: default tx ring [puppet] - 10https://gerrit.wikimedia.org/r/563976 [11:01:59] (03PS2) 10Filippo Giunchedi: WIP: puppetvagrant [puppet] - 10https://gerrit.wikimedia.org/r/561858 (owner: 10Ema) [11:02:01] (03PS2) 10Filippo Giunchedi: WIP puppetvagrant role [puppet] - 10https://gerrit.wikimedia.org/r/562527 [11:02:03] (03PS1) 10Filippo Giunchedi: WIP role::cache::text testing [puppet] - 10https://gerrit.wikimedia.org/r/566246 [11:02:39] (03PS1) 10Mvolz: Update citoid to 30f793422 (codfw & eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/566247 [11:02:58] (03PS2) 10Mvolz: Update citoid to 30f793422 (codfw & eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/566247 [11:04:43] (03CR) 10jerkins-bot: [V: 04-1] WIP puppetvagrant role [puppet] - 10https://gerrit.wikimedia.org/r/562527 (owner: 10Filippo Giunchedi) [11:05:09] (03PS1) 10Vgutierrez: Release 1.15.7 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/566248 (https://phabricator.wikimedia.org/T169765) [11:09:26] (03CR) 10Mvolz: [C: 03+2] Update citoid to 30f793422 (codfw & eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/566247 (owner: 10Mvolz) [11:09:42] (03Merged) 10jenkins-bot: Update citoid to 30f793422 (codfw & eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/566247 (owner: 10Mvolz) [11:10:19] (03PS2) 10Vgutierrez: Release 1.15.7 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/566248 (https://phabricator.wikimedia.org/T169765) [11:13:24] (03CR) 10Ema: [C: 03+1] Release 1.15.7 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/566248 (https://phabricator.wikimedia.org/T169765) (owner: 10Vgutierrez) [11:15:14] (03CR) 10Vgutierrez: [C: 03+2] Release 1.15.7 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/566248 (https://phabricator.wikimedia.org/T169765) (owner: 10Vgutierrez) [11:19:01] (03PS1) 10Giuseppe Lavagetto: service::catalog: fix host header for check_http* monitoring [puppet] - 10https://gerrit.wikimedia.org/r/566250 [11:22:02] !log mvolz@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'citoid' for release 'production' . [11:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:12] !log uploaded pybal 1.15.7 to apt.w.o (stretch) - T169765 [11:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:14] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [11:24:00] !log Updating pybal to 1.15.7 on ulsfo load balancers - T169765 [11:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:31] (03PS6) 10Arturo Borrero Gonzalez: kubernetes: add support for domain-based routing in the new kubernetes cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565575 (https://phabricator.wikimedia.org/T234617) [11:31:30] (03PS1) 10Elukey: profile::kerberos: check the principal before creating it [puppet] - 10https://gerrit.wikimedia.org/r/566251 [11:36:13] (03PS2) 10Elukey: profile::kerberos: check the principal before creating it [puppet] - 10https://gerrit.wikimedia.org/r/566251 [11:38:25] (03CR) 10jerkins-bot: [V: 04-1] profile::kerberos: check the principal before creating it [puppet] - 10https://gerrit.wikimedia.org/r/566251 (owner: 10Elukey) [11:39:26] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Move thumbor to the logging pipeline - https://phabricator.wikimedia.org/T242609 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thumbor now is logging to `localhost:11514`, from there logs are shipped to the logging pipeline, re... [11:39:30] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [11:39:54] !log mvolz@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' . [11:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:22] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10SBisson) A very special thanks to @phuedx for reviewing and merging the patch!! This task is now for @nshahquinn-wmf to te... [11:40:53] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [11:41:08] RECOVERY - snapshot of s5 in eqiad on db1115 is OK: snapshot for s5 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2020-01-21 10:51:30 from db1102.eqiad.wmnet:3315 (658 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [11:44:39] !log importing puppet-master packages to component/puppet5 [11:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:58] RECOVERY - snapshot of s7 in eqiad on db1115 is OK: snapshot for s7 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2020-01-21 09:49:45 from db1116.eqiad.wmnet:3317 (905 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [11:52:06] <_joe_> vgutierrez: let's go! [11:52:12] :D [11:52:44] <_joe_> !log restarting etcd on conf2003 to test new pybal reconnection. Issues expected for pybal in eqsin, but not in ulsfo [11:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:52] yep [11:53:55] ulsfo looks good [11:53:57] <_joe_> vgutierrez: confirmed :) [11:54:04] !log restarting pybal instancs on eqsin [11:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:13] <_joe_> ulsfo looks good, ok this is a huge plus for operations on etcd [11:55:41] :D [11:56:12] !log upgrading pybal on eqsin and codfw - T169765 [11:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:15] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [11:56:29] (03PS3) 10Elukey: profile::kerberos: check the principal before creating it [puppet] - 10https://gerrit.wikimedia.org/r/566251 [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200121T1200). [12:00:04] Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] o/ [12:00:22] (03PS3) 10Ladsgroup: Set useEntitySourceBasedFederation to true for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562578 (https://phabricator.wikimedia.org/T241972) [12:01:23] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562578 (https://phabricator.wikimedia.org/T241972) (owner: 10Ladsgroup) [12:01:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A few more comments inline" (0327 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [12:01:27] (03PS1) 10Effie Mouzeli: role::prometheus::beta: add mcrouter metrics [puppet] - 10https://gerrit.wikimedia.org/r/566255 [12:01:51] (03Merged) 10jenkins-bot: Set useEntitySourceBasedFederation to true for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562578 (https://phabricator.wikimedia.org/T241972) (owner: 10Ladsgroup) [12:03:30] o/ [12:03:44] uh [12:03:46] wikitech down? [12:03:55] well, it worked after a reload [12:04:39] (03PS1) 10Mvolz: Update zotero to 5953b26 (codfw & eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/566256 [12:04:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Given that pybal's backend unit is indeed the node and we anyway end up doing 2 load balancings (1 to nodes, and 1 to pods), this is proba" [puppet] - 10https://gerrit.wikimedia.org/r/566075 (owner: 10Giuseppe Lavagetto) [12:07:23] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:562578|Set useEntitySourceBasedFederation to true for Wikidata (T241972)]] (duration: 01m 12s) [12:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:26] T241972: wmgUseEntitySourceBasedFederation true for Wikidata.org - https://phabricator.wikimedia.org/T241972 [12:09:31] (03CR) 10Mvolz: [C: 03+2] Update zotero to 5953b26 (codfw & eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/566256 (owner: 10Mvolz) [12:09:48] (03Merged) 10jenkins-bot: Update zotero to 5953b26 (codfw & eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/566256 (owner: 10Mvolz) [12:11:03] (03CR) 10Joal: [C: 03+1] "looks good :)" [puppet] - 10https://gerrit.wikimedia.org/r/566231 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [12:11:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] logspam.pl: Shorten paths and include fatals [puppet] - 10https://gerrit.wikimedia.org/r/559246 (https://phabricator.wikimedia.org/T242252) (owner: 10Brennen Bearnes) [12:12:09] Doing it twice, the IS.php cache thingy [12:12:53] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:562578|Set useEntitySourceBasedFederation to true for Wikidata (T241972)]] (duration: 00m 59s) [12:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:56] T241972: wmgUseEntitySourceBasedFederation true for Wikidata.org - https://phabricator.wikimedia.org/T241972 [12:19:41] (03PS1) 10Alexandros Kosiaris: Fix helm release NOTES.txt text [deployment-charts] - 10https://gerrit.wikimedia.org/r/566260 [12:19:56] !log upgrading pybal on esams and eqiad - T169765 [12:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:00] T169765: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 [12:21:09] !log mvolz@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'zotero' for release 'production' . [12:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:00] (03PS2) 10Alexandros Kosiaris: Fix helm release NOTES.txt text [deployment-charts] - 10https://gerrit.wikimedia.org/r/566260 [12:25:03] (03PS1) 10Alexandros Kosiaris: scaffold: Remove the 1.8 kubernetes netpol if clause [deployment-charts] - 10https://gerrit.wikimedia.org/r/566262 [12:25:19] reverting [12:26:06] (03PS1) 10Ladsgroup: Revert "Set useEntitySourceBasedFederation to true for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566264 [12:26:24] (03CR) 10Ladsgroup: [C: 03+2] Revert "Set useEntitySourceBasedFederation to true for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566264 (owner: 10Ladsgroup) [12:27:19] (03Merged) 10jenkins-bot: Revert "Set useEntitySourceBasedFederation to true for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566264 (owner: 10Ladsgroup) [12:27:48] (03CR) 10Mvolz: [C: 03+1] Fix helm release NOTES.txt text [deployment-charts] - 10https://gerrit.wikimedia.org/r/566260 (owner: 10Alexandros Kosiaris) [12:28:49] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert [[gerrit:562578|Set useEntitySourceBasedFederation to true for Wikidata (T241972)]] (duration: 01m 00s) [12:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:52] T241972: wmgUseEntitySourceBasedFederation true for Wikidata.org - https://phabricator.wikimedia.org/T241972 [12:29:19] (03CR) 10Mvolz: Fix helm release NOTES.txt text (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/566260 (owner: 10Alexandros Kosiaris) [12:29:51] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert [[gerrit:562578|Set useEntitySourceBasedFederation to true for Wikidata (T241972)]] (duration: 00m 58s) [12:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix helm release NOTES.txt text [deployment-charts] - 10https://gerrit.wikimedia.org/r/566260 (owner: 10Alexandros Kosiaris) [12:30:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] scaffold: Remove the 1.8 kubernetes netpol if clause [deployment-charts] - 10https://gerrit.wikimedia.org/r/566262 (owner: 10Alexandros Kosiaris) [12:30:23] (03PS1) 10Jbond: fastnetmon: correct threshold_bandwidth name. [puppet] - 10https://gerrit.wikimedia.org/r/566274 [12:30:27] (03Merged) 10jenkins-bot: Fix helm release NOTES.txt text [deployment-charts] - 10https://gerrit.wikimedia.org/r/566260 (owner: 10Alexandros Kosiaris) [12:30:33] (03Merged) 10jenkins-bot: scaffold: Remove the 1.8 kubernetes netpol if clause [deployment-charts] - 10https://gerrit.wikimedia.org/r/566262 (owner: 10Alexandros Kosiaris) [12:31:09] (03CR) 10Alexandros Kosiaris: codesearch: Use iptables from buster-backports for docker compatibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/565752 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [12:51:11] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10jijiki) @Cmjohnson @Jclark-ctr please let us know when we can have those servers online [12:53:41] (03CR) 10Ema: Release 2.0.91-2wm (031 comment) [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/565550 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [12:54:14] (03CR) 10Ema: [C: 03+1] 1.3.1-3 Rebuild for buster [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/562801 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [12:54:39] (03CR) 10Ema: [C: 03+1] 1.7-3: Rebuild for buster [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/562515 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [13:00:31] (03PS2) 10CDanis: fastnetmon: correct threshold_bandwidth name. [puppet] - 10https://gerrit.wikimedia.org/r/566274 (owner: 10Jbond) [13:00:34] !log mvolz@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'zotero' for release 'production' . [13:00:34] (03CR) 10CDanis: [C: 03+1] fastnetmon: correct threshold_bandwidth name. [puppet] - 10https://gerrit.wikimedia.org/r/566274 (owner: 10Jbond) [13:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:58] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [13:24:34] (03CR) 10Giuseppe Lavagetto: "> That being said, kubeproxy is not really a service, just a component of the cluster. I 'd rather we set something else as a value. "kube" [puppet] - 10https://gerrit.wikimedia.org/r/566075 (owner: 10Giuseppe Lavagetto) [13:24:38] !log Clean up some gerrit grants on db1132 (m2 master) T233714 [13:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:12] PROBLEM - WDQS high update lag on wdqs1010 is CRITICAL: 3657 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:27:28] looking but wdqs1010 is a test machine [13:27:30] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:29:37] 10Operations, 10Gerrit: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Marostegui) [13:33:02] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [13:35:11] (03PS2) 10Giuseppe Lavagetto: conftool: add kubeproxy service [puppet] - 10https://gerrit.wikimedia.org/r/566075 [13:35:13] (03PS2) 10Giuseppe Lavagetto: lvs: switch all services on k8s to use the same conftool service [puppet] - 10https://gerrit.wikimedia.org/r/566076 [13:35:15] (03PS2) 10Giuseppe Lavagetto: conftool: remove unused services from kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/566077 [13:38:04] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:38:34] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [13:38:34] (03PS3) 10Giuseppe Lavagetto: conftool: add kubesvc service [puppet] - 10https://gerrit.wikimedia.org/r/566075 [13:38:36] (03PS3) 10Giuseppe Lavagetto: lvs: switch all services on k8s to use the same conftool service [puppet] - 10https://gerrit.wikimedia.org/r/566076 [13:38:38] (03PS3) 10Giuseppe Lavagetto: conftool: remove unused services from kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/566077 [13:40:58] (03PS2) 10Muehlenhoff: Switch some of the WMCS systems to standardized Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566035 (https://phabricator.wikimedia.org/T156955) [13:41:40] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:44:02] (03CR) 10Muehlenhoff: [C: 03+2] Switch some of the WMCS systems to standardized Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566035 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:47:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: add kubesvc service [puppet] - 10https://gerrit.wikimedia.org/r/566075 (owner: 10Giuseppe Lavagetto) [13:48:53] (03CR) 10Filippo Giunchedi: [C: 03+1] role::prometheus::beta: add mcrouter metrics [puppet] - 10https://gerrit.wikimedia.org/r/566255 (owner: 10Effie Mouzeli) [13:48:58] (03PS3) 10Ema: netconsole: add netconsole::server [puppet] - 10https://gerrit.wikimedia.org/r/566064 (https://phabricator.wikimedia.org/T242579) [13:49:07] (03PS1) 10Arturo Borrero Gonzalez: mirrors: mirror Debian openstack backports repositories [puppet] - 10https://gerrit.wikimedia.org/r/566284 (https://phabricator.wikimedia.org/T238820) [13:51:28] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [13:51:39] (03PS2) 10Arturo Borrero Gonzalez: mirrors: mirror Debian openstack backports repositories [puppet] - 10https://gerrit.wikimedia.org/r/566284 (https://phabricator.wikimedia.org/T238820) [13:52:23] (03CR) 10Ayounsi: [C: 03+1] "Indeed, thanks for the fix!" [puppet] - 10https://gerrit.wikimedia.org/r/566274 (owner: 10Jbond) [13:53:37] (03CR) 10Jbond: [C: 03+2] fastnetmon: correct threshold_bandwidth name. [puppet] - 10https://gerrit.wikimedia.org/r/566274 (owner: 10Jbond) [13:54:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/566054 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [13:54:12] (03PS3) 10Arturo Borrero Gonzalez: mirrors: mirror Debian openstack backports repositories [puppet] - 10https://gerrit.wikimedia.org/r/566284 (https://phabricator.wikimedia.org/T238820) [13:54:14] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/20478/" [puppet] - 10https://gerrit.wikimedia.org/r/566284 (https://phabricator.wikimedia.org/T238820) (owner: 10Arturo Borrero Gonzalez) [13:56:42] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch some analytics roles to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566237 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:56:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline, but feel free to ignore." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566251 (owner: 10Elukey) [13:58:52] (03CR) 10Elukey: profile::kerberos: check the principal before creating it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566251 (owner: 10Elukey) [14:02:46] (03PS3) 10Filippo Giunchedi: install_server: introduce raid0 standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/564959 (https://phabricator.wikimedia.org/T156955) [14:02:48] (03PS1) 10Filippo Giunchedi: install_server: switch deploy* to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/566286 (https://phabricator.wikimedia.org/T156955) [14:02:57] !log oblivian@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: service=kubesvc,cluster=kubernetes [14:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:13] (03CR) 10Muehlenhoff: [C: 03+1] profile::kerberos: check the principal before creating it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566251 (owner: 10Elukey) [14:04:19] (03PS4) 10Ema: netconsole: add netconsole::server [puppet] - 10https://gerrit.wikimedia.org/r/566064 (https://phabricator.wikimedia.org/T242579) [14:04:22] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:05:09] (03PS4) 10Elukey: profile::kerberos: check the principal before creating it [puppet] - 10https://gerrit.wikimedia.org/r/566251 [14:07:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/566251 (owner: 10Elukey) [14:08:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs: switch all services on k8s to use the same conftool service [puppet] - 10https://gerrit.wikimedia.org/r/566076 (owner: 10Giuseppe Lavagetto) [14:08:06] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:09:28] (03PS1) 10Ema: Add netconsole server to esams ganeti [puppet] - 10https://gerrit.wikimedia.org/r/566288 (https://phabricator.wikimedia.org/T242579) [14:10:24] RECOVERY - WDQS high update lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1133 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:11:08] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:11:11] (03PS1) 10Filippo Giunchedi: base: display disabled message when check_puppetrun is in grace period [puppet] - 10https://gerrit.wikimedia.org/r/566289 [14:11:26] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1002/20479/" [puppet] - 10https://gerrit.wikimedia.org/r/566288 (https://phabricator.wikimedia.org/T242579) (owner: 10Ema) [14:11:46] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:14:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/566286 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [14:17:18] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:19:28] (03PS1) 10Filippo Giunchedi: install_server: move oresrdb and sessionstore to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566290 (https://phabricator.wikimedia.org/T156955) [14:20:41] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: switch deploy* to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/566286 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [14:20:49] (03PS2) 10Filippo Giunchedi: install_server: switch deploy* to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/566286 (https://phabricator.wikimedia.org/T156955) [14:20:58] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:24:07] <_joe_> !log restarting pybal on lvs low-traffic in codfw [14:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:08] (03PS2) 10Filippo Giunchedi: install_server: move oresrdb and sessionstore to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566290 (https://phabricator.wikimedia.org/T156955) [14:25:21] 10Operations: Retire the Tor relay - https://phabricator.wikimedia.org/T243288 (10MoritzMuehlenhoff) [14:25:28] 10Operations: Retire the Tor relay - https://phabricator.wikimedia.org/T243288 (10MoritzMuehlenhoff) p:05Triage→03Normal [14:26:21] (03CR) 10Ema: [C: 03+2] netconsole: add netconsole::server [puppet] - 10https://gerrit.wikimedia.org/r/566064 (https://phabricator.wikimedia.org/T242579) (owner: 10Ema) [14:28:03] (03PS1) 10Filippo Giunchedi: install_server: switch ms-fe to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566291 (https://phabricator.wikimedia.org/T156955) [14:28:36] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:30:36] !log cdanis@cumin2001 conftool action : set/weight=15; selector: cluster=api_appserver,dc=eqiad,service=nginx,name=mw12[23].* [14:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:53] (03PS1) 10Alexandros Kosiaris: Remove all 1.7 netpol checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/566292 [14:33:25] !log cdanis@cumin2001 conftool action : set/weight=30; selector: cluster=api_appserver,dc=eqiad,service=nginx,name=mw13.* [14:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:37] !log cdanis@cumin2001 conftool action : set/weight=30; selector: cluster=api_appserver,dc=eqiad,service=apache2,name=mw13.* [14:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:30] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep - https://phabricator.wikimedia.org/T243226 (10MoritzMuehlenhoff) p:05Triage→03Normal a:03jbond [14:34:57] (03PS2) 10Ema: netconsole: add server to esams ganeti [puppet] - 10https://gerrit.wikimedia.org/r/566288 (https://phabricator.wikimedia.org/T242579) [14:35:58] <_joe_> !log restart pybal on low-traffic eqiad to pick up new configuration [14:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:25] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:36:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:16] !log uploaded varnish-modules 0.12-1+wmf2 to apt.w.o (buster) - T242093 [14:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:19] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [14:37:55] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] 1.3.1-3 Rebuild for buster [software/varnish/libvmod-re2] (debian) - 10https://gerrit.wikimedia.org/r/562801 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [14:38:11] (03CR) 10Vgutierrez: [C: 03+2] 1.7-3: Rebuild for buster [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/562515 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [14:38:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove all 1.7 netpol checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/566292 (owner: 10Alexandros Kosiaris) [14:38:56] !log Rolling restart all eqiad mw api servers [14:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:11] (03Merged) 10jenkins-bot: Remove all 1.7 netpol checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/566292 (owner: 10Alexandros Kosiaris) [14:39:22] !log stopping/masking tor on torrelay1001 T243288 [14:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:25] T243288: Retire the Tor relay - https://phabricator.wikimedia.org/T243288 [14:39:45] (03PS3) 10Ema: netconsole: add server to esams ganeti [puppet] - 10https://gerrit.wikimedia.org/r/566288 (https://phabricator.wikimedia.org/T242579) [14:39:52] (03CR) 10Elukey: [C: 03+2] profile::kerberos: check the principal before creating it [puppet] - 10https://gerrit.wikimedia.org/r/566251 (owner: 10Elukey) [14:40:27] (03CR) 10jerkins-bot: [V: 04-1] netconsole: add server to esams ganeti [puppet] - 10https://gerrit.wikimedia.org/r/566288 (https://phabricator.wikimedia.org/T242579) (owner: 10Ema) [14:40:55] (03PS1) 10Filippo Giunchedi: install_server: switch wtp/weblog to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566293 (https://phabricator.wikimedia.org/T156955) [14:41:37] (03PS5) 10Vgutierrez: Release 2.0.91-2wm [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/565550 (https://phabricator.wikimedia.org/T242093) [14:42:07] (03CR) 10Vgutierrez: Release 2.0.91-2wm (031 comment) [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/565550 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [14:42:35] (03PS4) 10Ema: netconsole: add server to esams ganeti [puppet] - 10https://gerrit.wikimedia.org/r/566288 (https://phabricator.wikimedia.org/T242579) [14:42:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: remove unused services from kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/566077 (owner: 10Giuseppe Lavagetto) [14:42:59] (03CR) 10Ema: [C: 03+1] Release 2.0.91-2wm [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/565550 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [14:43:16] (03PS4) 10Giuseppe Lavagetto: conftool: remove unused services from kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/566077 [14:43:18] (03CR) 10Vgutierrez: [C: 03+2] Release 2.0.91-2wm [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/565550 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [14:45:35] (03CR) 10Muehlenhoff: "TTBOMK the plan for oresrdb is to simply remove them in favour of Ores using the main rdb* cluster, as such we could also simply drop them" [puppet] - 10https://gerrit.wikimedia.org/r/566290 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [14:47:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/566293 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [14:52:17] (03CR) 10Vgutierrez: [C: 03+1] netconsole: add server to esams ganeti [puppet] - 10https://gerrit.wikimedia.org/r/566288 (https://phabricator.wikimedia.org/T242579) (owner: 10Ema) [14:53:17] (03CR) 10Ema: [C: 03+2] netconsole: add server to esams ganeti [puppet] - 10https://gerrit.wikimedia.org/r/566288 (https://phabricator.wikimedia.org/T242579) (owner: 10Ema) [14:55:59] (03CR) 10Ottomata: "> Is there a specific reason to diverge from the wmf.releasename, wmf.chartname etc?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [14:56:31] !log uploaded libvmod-netmapper 1.7-3 to apt.w.o (buster) - T242093 [14:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:34] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [14:57:09] (03CR) 10Ammarpad: "Yes, I know, and in fact I was there for the first two. I missed the third one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [14:57:22] !log uploaded libvmod-re2 1.3.1-3 to apt.w.o (buster) - T242093 [14:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:38] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep - https://phabricator.wikimedia.org/T243226 (10jbond) >>! In T243226#5817998, @Krenair wrote: > am guessing this is just us needing to get a new puppetmaster with buster instead of stretch I added the puppet-master version 5 packa... [15:01:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/566289 (owner: 10Filippo Giunchedi) [15:01:52] (03CR) 10Muehlenhoff: "Looks good, but two comments:" [puppet] - 10https://gerrit.wikimedia.org/r/566291 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:02:22] PROBLEM - Check systemd state on cp3055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:45] looking [15:06:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Ah, it's due to a5f0dcc19fe03884c2930760c15c1cfb60fd8502, hmm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [15:15:18] RECOVERY - Check systemd state on cp3055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:20] (03PS13) 10Ottomata: New eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) [15:31:10] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add RPKI whitelist support (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/565771 (owner: 10Ayounsi) [15:31:55] 10Operations, 10Traffic: Setup netconsole on upload@esams hosts - https://phabricator.wikimedia.org/T242579 (10ema) 05Open→03Resolved a:03ema This is now done in prod. All upload@esams nodes are sending their kernel messages to a central host. See `journalctl -u netconsole` on ganeti3002. [15:31:59] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) [15:32:35] (03CR) 10Filippo Giunchedi: [C: 03+2] base: display disabled message when check_puppetrun is in grace period [puppet] - 10https://gerrit.wikimedia.org/r/566289 (owner: 10Filippo Giunchedi) [15:32:41] 10Operations, 10ops-codfw, 10DBA: db2085 crashed - memory issues - https://phabricator.wikimedia.org/T243148 (10Papaul) p:05Triage→03Normal [15:35:16] (03CR) 10Elukey: "A couple of notes:" [puppet] - 10https://gerrit.wikimedia.org/r/566237 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [15:44:25] (03Abandoned) 10Giuseppe Lavagetto: systemd: remove references to hhvm in the tests [puppet] - 10https://gerrit.wikimedia.org/r/544844 (owner: 10Giuseppe Lavagetto) [15:45:22] (03Abandoned) 10Giuseppe Lavagetto: lvs::configuration: do not assume port 80 by default [puppet] - 10https://gerrit.wikimedia.org/r/545525 (owner: 10Giuseppe Lavagetto) [15:48:17] (03CR) 10Vgutierrez: [C: 03+1] service::catalog: fix host header for check_http* monitoring [puppet] - 10https://gerrit.wikimedia.org/r/566250 (owner: 10Giuseppe Lavagetto) [15:49:57] (03PS2) 10Andrew Bogott: Remove hieradata/common/openstack.yaml [puppet] - 10https://gerrit.wikimedia.org/r/564662 [15:50:43] (03PS1) 10Giuseppe Lavagetto: wmflib::service: add get_pool_nodes function [puppet] - 10https://gerrit.wikimedia.org/r/566300 [15:51:57] (03PS1) 10Tchanders: Remove partial blocks banner from all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566302 (https://phabricator.wikimedia.org/T240300) [15:54:40] (03CR) 10Andrew Bogott: [C: 03+2] Remove hieradata/common/openstack.yaml [puppet] - 10https://gerrit.wikimedia.org/r/564662 (owner: 10Andrew Bogott) [15:56:25] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) @jijiki I have no problem with that. [15:56:45] (03PS2) 10Andrew Bogott: make-instance-vg: check if lvm is present before we start [puppet] - 10https://gerrit.wikimedia.org/r/564847 (https://phabricator.wikimedia.org/T241868) [15:57:40] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/566291 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:58:36] (03CR) 10Andrew Bogott: [C: 03+2] make-instance-vg: check if lvm is present before we start [puppet] - 10https://gerrit.wikimedia.org/r/564847 (https://phabricator.wikimedia.org/T241868) (owner: 10Andrew Bogott) [15:59:51] (03PS2) 10Giuseppe Lavagetto: wmflib::service: add get_pool_nodes function [puppet] - 10https://gerrit.wikimedia.org/r/566300 [16:00:03] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10Halfak) @akosiaris, I'm at a loss here. I can't figure out what might have happened and I am struggling to reproduce the behavior. What do you think a good next step might... [16:02:15] (03CR) 10Muehlenhoff: "Ack, if those are moved to /srv, we don't need to bother with delaycompress." [puppet] - 10https://gerrit.wikimedia.org/r/566291 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [16:02:31] !log uploaded libvmod-tbf 2.0.91-2wm to apt.w.o (buster) - T242093 [16:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:35] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [16:02:44] (03PS3) 10Giuseppe Lavagetto: wmflib::service: add get_pool_nodes function [puppet] - 10https://gerrit.wikimedia.org/r/566300 [16:03:16] (03CR) 10CDanis: [C: 03+2] ripeatlas alerts: link to the grafana dashboard too [puppet] - 10https://gerrit.wikimedia.org/r/565060 (owner: 10CDanis) [16:03:18] (03CR) 10Ayounsi: [C: 03+1] ripeatlas alerts: link to the grafana dashboard too [puppet] - 10https://gerrit.wikimedia.org/r/565060 (owner: 10CDanis) [16:03:25] (03PS3) 10Andrew Bogott: mwopenstackclients: use keystoneauth1 sessions [puppet] - 10https://gerrit.wikimedia.org/r/565457 [16:04:37] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: use keystoneauth1 sessions [puppet] - 10https://gerrit.wikimedia.org/r/565457 (owner: 10Andrew Bogott) [16:05:11] (03PS4) 10Giuseppe Lavagetto: wmflib::service: add get_pool_nodes function [puppet] - 10https://gerrit.wikimedia.org/r/566300 [16:06:57] (03PS1) 10Filippo Giunchedi: swift: move logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/566303 [16:07:44] 10Operations, 10Traffic, 10netops: esams ipv6 reachability degraded - https://phabricator.wikimedia.org/T243127 (10CDanis) 05Open→03Resolved a:03CDanis This has recovered, but at its worst point we were down to just barely above 93% of probes with ipv6 reachability to esams. [16:07:57] (03PS1) 10Alexandros Kosiaris: eventgate: Remove extraneous comment in _helpers.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/566304 [16:08:23] (03CR) 10Dmaza: [C: 03+1] Remove partial blocks banner from all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566302 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [16:08:34] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on esams caches - https://phabricator.wikimedia.org/T243167 (10wiki_willy) a:03RobH [16:08:36] (03CR) 10Ottomata: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/566237 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:08:54] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/566291 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [16:09:27] (03CR) 10Ottomata: [C: 03+2] "Uh, dunno why that's there! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/566304 (owner: 10Alexandros Kosiaris) [16:09:47] 10Operations, 10DBA: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) a:03jcrespo [16:11:13] (03CR) 10Elukey: [C: 03+1] Switch some analytics roles to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566237 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:13:35] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on esams caches - https://phabricator.wikimedia.org/T243167 (10wiki_willy) @RobH - can you work with the traffic to get the bios upgraded on the cp hosts in esams? In T240177, @Papaul found a Dell bulletin that associated wit... [16:13:58] (03PS2) 10Filippo Giunchedi: install_server: switch ms-fe to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566291 (https://phabricator.wikimedia.org/T156955) [16:14:00] (03PS2) 10Filippo Giunchedi: install_server: switch wtp/weblog to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566293 (https://phabricator.wikimedia.org/T156955) [16:15:43] !log copied prometheus-varnishkafka-exporter from stretch to buster on apt.w.o - T242093 [16:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:46] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [16:18:12] (03CR) 10Ottomata: New eventstreams chart (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [16:18:39] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-dns-floating-ip-updater.py: move to python3 [puppet] - 10https://gerrit.wikimedia.org/r/565458 (https://phabricator.wikimedia.org/T229920) (owner: 10Andrew Bogott) [16:19:22] (03PS14) 10Ottomata: New eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) [16:19:53] (03PS1) 10Elukey: aptrepo: update bigtop's allowed packages filter [puppet] - 10https://gerrit.wikimedia.org/r/566306 [16:21:18] (03PS3) 10Andrew Bogott: Remove old unused OpenStackManager variables [puppet] - 10https://gerrit.wikimedia.org/r/565099 (owner: 10Reedy) [16:21:31] (03PS15) 10Ottomata: New eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) [16:21:39] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565176 (owner: 10CRusnov) [16:21:52] (03Abandoned) 10CRusnov: puppetdb: fix pep8 complaint [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565169 (owner: 10CRusnov) [16:22:43] (03CR) 10Andrew Bogott: [C: 03+2] Remove old unused OpenStackManager variables [puppet] - 10https://gerrit.wikimedia.org/r/565099 (owner: 10Reedy) [16:24:52] (03CR) 10CRusnov: [C: 03+1] "Seems a reasonable change." [software/spicerack] - 10https://gerrit.wikimedia.org/r/566053 (owner: 10Volans) [16:26:17] (03CR) 10CRusnov: [C: 03+1] "LGTM thanks" [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553741 (owner: 10Hashar) [16:26:41] (03PS5) 10Giuseppe Lavagetto: wmflib::service: add get_pool_nodes function [puppet] - 10https://gerrit.wikimedia.org/r/566300 [16:27:32] (03CR) 10Elukey: [C: 03+2] aptrepo: update bigtop's allowed packages filter [puppet] - 10https://gerrit.wikimedia.org/r/566306 (owner: 10Elukey) [16:30:01] (03CR) 10Gehel: [C: 03+2] airflow: Provide runtime directory for skein [puppet] - 10https://gerrit.wikimedia.org/r/565646 (owner: 10EBernhardson) [16:30:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/20489/ this is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/566300 (owner: 10Giuseppe Lavagetto) [16:31:49] (03PS1) 10Vgutierrez: Release 8.0.5-1wm13 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566309 (https://phabricator.wikimedia.org/T242093) [16:32:01] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm13 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566309 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [16:32:32] (03PS4) 10Andrew Bogott: CloudVPS: codfw1dev: Fix default SSH rule to use correct range [puppet] - 10https://gerrit.wikimedia.org/r/565431 (https://phabricator.wikimedia.org/T229441) (owner: 10Alex Monk) [16:34:47] (03PS1) 10Giuseppe Lavagetto: mediawiki::php::restarts: count nodes only once [puppet] - 10https://gerrit.wikimedia.org/r/566310 [16:35:06] (03CR) 10Andrew Bogott: [C: 03+2] CloudVPS: codfw1dev: Fix default SSH rule to use correct range [puppet] - 10https://gerrit.wikimedia.org/r/565431 (https://phabricator.wikimedia.org/T229441) (owner: 10Alex Monk) [16:35:10] (03CR) 10Bstorm: [RFC] kubernetes: add support for multiple objects of any kind (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) (owner: 10Arturo Borrero Gonzalez) [16:35:31] !log install software upgrade on pfw3b-eqiad (secondary, no restart yet) [16:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:40] Jeff_Green: ^ [16:37:17] XioNoX: woot [16:37:28] (03CR) 10Bstorm: [RFC] kubernetes: add support for multiple objects of any kind (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) (owner: 10Arturo Borrero Gonzalez) [16:37:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php::restarts: count nodes only once [puppet] - 10https://gerrit.wikimedia.org/r/566310 (owner: 10Giuseppe Lavagetto) [16:38:08] (03PS2) 10Giuseppe Lavagetto: mediawiki::php::restarts: count nodes only once [puppet] - 10https://gerrit.wikimedia.org/r/566310 [16:40:07] (03PS1) 10Ayounsi: Add read-only user dwisehaupt [homer/public] - 10https://gerrit.wikimedia.org/r/566312 (https://phabricator.wikimedia.org/T242758) [16:40:45] (03PS2) 10Vgutierrez: Release 8.0.5-1wm13 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566309 (https://phabricator.wikimedia.org/T242093) [16:40:57] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm13 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566309 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [16:45:11] !log install software upgrade on pfw3a-eqiad (primary, no restart yet) [16:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:11] (03PS3) 10Vgutierrez: Release 8.0.5-1wm13 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566309 (https://phabricator.wikimedia.org/T242093) [16:47:23] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm13 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566309 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [16:47:46] (03CR) 10BryanDavis: [RFC] kubernetes: add support for multiple objects of any kind (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) (owner: 10Arturo Borrero Gonzalez) [16:48:06] 10Operations, 10TechCom-RFC, 10Traffic, 10Core Platform Team Legacy (Designing), 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906 (10chasemp) 05Open→03Declined L... [16:48:12] (03PS1) 10Giuseppe Lavagetto: scap::source: use wmflib::service::fetch [puppet] - 10https://gerrit.wikimedia.org/r/566314 [16:49:10] (03CR) 10Volans: [C: 03+2] tests: simplify and centralize skipif markers [software/spicerack] - 10https://gerrit.wikimedia.org/r/566053 (owner: 10Volans) [16:50:12] (03CR) 10Andrew Bogott: [C: 03+2] "thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/565431 (https://phabricator.wikimedia.org/T229441) (owner: 10Alex Monk) [16:50:19] (03PS2) 10Giuseppe Lavagetto: scap::source: use wmflib::service::fetch [puppet] - 10https://gerrit.wikimedia.org/r/566314 [16:50:38] (03PS4) 10Arturo Borrero Gonzalez: [RFC] kubernetes: add support for multiple objects of any kind [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) [16:51:18] (03PS5) 10Arturo Borrero Gonzalez: kubernetes: add support for multiple objects of any kind [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) [16:53:29] (03Merged) 10jenkins-bot: tests: simplify and centralize skipif markers [software/spicerack] - 10https://gerrit.wikimedia.org/r/566053 (owner: 10Volans) [16:53:44] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on esams caches - https://phabricator.wikimedia.org/T243167 (10RobH) a:05RobH→03BBlack Please note that #traffic (and @bblack) previously asked me NOT to do this on these hosts, while they worked out why they are crashing.... [16:55:11] (03CR) 10Bstorm: [C: 03+1] "I think this would do what we'd need for multi-object deployments and ingresses (as well as cleaning up mixed up tools)." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/566045 (https://phabricator.wikimedia.org/T156626) (owner: 10Arturo Borrero Gonzalez) [16:56:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/20491/deploy1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/566314 (owner: 10Giuseppe Lavagetto) [16:57:29] Jeff_Green: ready to reboot both pfw3-eqiad members? [16:58:00] please hold a minute, I'm setting stuff into maintenance mode to reduce chaos [16:59:01] (03PS2) 10Bstorm: dumps-distribution: switch to sharing out only the pertinent dir on nfs [puppet] - 10https://gerrit.wikimedia.org/r/565405 (https://phabricator.wikimedia.org/T242798) [16:59:13] yep, let me know when good to go, I downtimed the firewall in icinga [16:59:21] ok [17:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200121T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:01:28] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.wikimedia.org - https://phabricator.wikimedia.org/T241337 (10Papaul) [17:02:06] (03CR) 10Bstorm: [C: 03+2] dumps-distribution: switch to sharing out only the pertinent dir on nfs [puppet] - 10https://gerrit.wikimedia.org/r/565405 (https://phabricator.wikimedia.org/T242798) (owner: 10Bstorm) [17:02:09] (03CR) 10Andrew Bogott: "I question about these patches: are there distros (future or present) where #!/usr/bin/python will launch python3 instead of 2?" [puppet] - 10https://gerrit.wikimedia.org/r/565793 (https://phabricator.wikimedia.org/T229920) (owner: 10Legoktm) [17:02:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] dumps-distribution: switch to sharing out only the pertinent dir on nfs [puppet] - 10https://gerrit.wikimedia.org/r/565405 (https://phabricator.wikimedia.org/T242798) (owner: 10Bstorm) [17:02:49] (03PS4) 10Vgutierrez: Release 8.0.5-1wm13 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566309 (https://phabricator.wikimedia.org/T242093) [17:03:07] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm13 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566309 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [17:04:12] (03CR) 10Lucas Werkmeister (WMDE): "> I question about these patches: are there distros (future or present) where #!/usr/bin/python will launch python3 instead of 2?" [puppet] - 10https://gerrit.wikimedia.org/r/565793 (https://phabricator.wikimedia.org/T229920) (owner: 10Legoktm) [17:04:53] (03PS5) 10Vgutierrez: Release 8.0.5-1wm13 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566309 (https://phabricator.wikimedia.org/T242093) [17:05:04] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm13 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/566309 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [17:06:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/565793 (https://phabricator.wikimedia.org/T229920) (owner: 10Legoktm) [17:06:53] 10Operations, 10ops-codfw: codfw: rack/setup/install wdqs202[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) [17:07:20] (03CR) 10Andrew Bogott: "> I'm not sure either if it will point to python3 or will simply stop working." [puppet] - 10https://gerrit.wikimedia.org/r/565793 (https://phabricator.wikimedia.org/T229920) (owner: 10Legoktm) [17:07:57] XioNoX: OK you're good to start [17:08:09] alright! [17:08:38] !log restart pfw3-eqiad for software upgrade [17:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:04] Uptime: 846d23h22m21s [17:10:11] eep [17:12:29] (03CR) 10Muehlenhoff: "There are no plans for Python 3 in Debian to overtake the python name, in fact in unstable code which explicitly continues to use Py2 for " [puppet] - 10https://gerrit.wikimedia.org/r/565793 (https://phabricator.wikimedia.org/T229920) (owner: 10Legoktm) [17:12:47] so far evertyhing is booting up properly on console links [17:16:33] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:16:44] that's expected ^ [17:16:57] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:19:44] boot process still going as expected [17:20:19] !log starting branch cut for T233864 [17:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:23] T233864: 1.35.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T233864 [17:21:27] Jeff_Green: host is up, checking [17:21:47] ok [17:22:44] up on the console, but some processes are still booting up [17:22:51] PROBLEM - Host checker.tools.wmflabs.org is DOWN: CRITICAL - Host Unreachable (checker.tools.wmflabs.org) [17:23:27] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [17:23:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:24:17] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:25:24] alright, interfaces up, and the host now pings [17:25:39] cluster is healthy [17:26:22] ipsec still down [17:26:32] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10Cmjohnson) 05Open→03Resolved Replaced the CMOS battery, reset the date/time in bios. resolving [17:27:32] Jeff_Green: alright, everything looks good to me [17:28:06] XioNoX: ok, so far things look normal to me too [17:28:21] exactly 20min [17:28:48] ++ [17:32:48] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@ae77f9d]: Deploy ores_drafttopics dag [17:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:11] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@ae77f9d]: Deploy ores_drafttopics dag (duration: 00m 22s) [17:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:16] (03CR) 10Volans: [C: 04-1] "Some specific comments inline." (036 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [17:36:39] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10akosiaris) Thanks for the ping. Notes: * This is also evident at https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=codfw%20prometheus%2Fops&v... [17:37:18] !log re-exported NFS from labstore1006/7 [17:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:36] (03PS1) 10Ayounsi: Extend PRODUCT_NAMES_IGNORE [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/566321 (https://phabricator.wikimedia.org/T213843) [17:39:07] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@986769c]: bulk_daemon: Treat model exists as unrecoverable failure [17:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:17] (03CR) 10Ayounsi: "Not tested." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/566321 (https://phabricator.wikimedia.org/T213843) (owner: 10Ayounsi) [17:42:30] (03PS1) 10DannyS712: Fix typos (boostrap -> bootstrap) [puppet] - 10https://gerrit.wikimedia.org/r/566323 (https://phabricator.wikimedia.org/T201491) [17:43:49] (03PS2) 10DannyS712: Fix typos (boostrap -> bootstrap) [puppet] - 10https://gerrit.wikimedia.org/r/566323 (https://phabricator.wikimedia.org/T201491) [17:44:49] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@986769c]: bulk_daemon: Treat model exists as unrecoverable failure (duration: 05m 42s) [17:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:24] !log add dwisehaupt user to pfw/fasw - T242758 [17:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:41] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:49:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:50:31] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add read-only user dwisehaupt [homer/public] - 10https://gerrit.wikimedia.org/r/566312 (https://phabricator.wikimedia.org/T242758) (owner: 10Ayounsi) [18:00:04] cscott, arlolra, subbu, halfak, and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200121T1800). [18:02:02] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Cmjohnson) [18:03:02] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Cmjohnson) these are cabled, running into an issue on the switch cfg removing from disabled vlan. connected fasw1-c ge-0/0/0 (eth 1) and fasw2-c ge-1/0/0 (e... [18:04:22] (03PS1) 10Brennen Bearnes: Group0 to 1.35.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566330 [18:19:51] !log brennen@deploy1001 Started scap: testwiki to php-1.35.0-wmf.16 and rebuild l10n cache [18:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:14] (03CR) 10EBernhardson: [cirrus] A/B test for MLR models (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [18:29:23] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Cmjohnson) [18:30:17] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Cmjohnson) a:05Cmjohnson→03Jgreen [18:30:40] (03PS3) 10Dzahn: remove production IPs for phab1003 [dns] - 10https://gerrit.wikimedia.org/r/552601 (https://phabricator.wikimedia.org/T238957) [18:30:51] (03PS2) 10Ayounsi: Extend PRODUCT_NAMES_IGNORE [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/566321 (https://phabricator.wikimedia.org/T213843) [18:30:53] 10Operations, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Cmjohnson) [18:34:38] (03CR) 10Dzahn: [C: 03+2] remove production IPs for phab1003 [dns] - 10https://gerrit.wikimedia.org/r/552601 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [18:35:04] (03CR) 10Dzahn: [C: 03+2] "mysql grants have been removed" [dns] - 10https://gerrit.wikimedia.org/r/552601 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [18:36:17] ACKNOWLEDGEMENT - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 4 days ago: Most recent backup 2020-01-17 08:53:02 Jcrespo retrying prepare backup https://wikitech.wikimedia.org/wiki/MariaDB/Backups [18:40:19] 10Operations: 2020 Q3 DC switchover and switchback - https://phabricator.wikimedia.org/T243314 (10RLazarus) [18:40:46] 10Operations, 10DC-Ops, 10netops, 10Patch-For-Review: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) If we do abstraction of all the power supplies (that the CR above is for), there are still inconsistencies, but the list is progressively shrinking. Some ar... [18:41:14] 10Operations, 10Performance-Team, 10serviceops: Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10aaron) What user impact did it cause? [18:42:09] 10Operations: 2020 Q3 eqiad -> codfw switchover - https://phabricator.wikimedia.org/T243316 (10RLazarus) [18:42:17] (03PS1) 10Andrew Bogott: labstore2003/2004: change to role(spare::system) [puppet] - 10https://gerrit.wikimedia.org/r/566334 (https://phabricator.wikimedia.org/T239884) [18:43:06] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Jgreen) [18:43:16] (03CR) 10Andrew Bogott: [C: 03+2] labstore2003/2004: change to role(spare::system) [puppet] - 10https://gerrit.wikimedia.org/r/566334 (https://phabricator.wikimedia.org/T239884) (owner: 10Andrew Bogott) [18:43:54] 10Operations: 2020 Q3 (or later) codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10RLazarus) [18:47:10] 10Operations, 10DC-Ops, 10decommission: decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Andrew) [18:47:41] 10Operations, 10DC-Ops, 10decommission: decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Andrew) [18:50:18] !log brennen@deploy1001 Finished scap: testwiki to php-1.35.0-wmf.16 and rebuild l10n cache (duration: 30m 27s) [18:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:15] 10Operations, 10DC-Ops, 10decommission: decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Andrew) [18:58:23] (03CR) 10Paladox: [C: 03+1] ferm_misc/db: allow connections from gerrit1002 in ferm [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [18:59:40] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [18:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200121T1900) [19:00:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:44] 10Operations, 10DC-Ops, 10decommission: decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `labstore2003.codfw.wmnet` - labstore2003.codfw.wmnet (**PASS... [19:00:55] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [19:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:29] 10Operations, 10DC-Ops, 10decommission: decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Andrew) [19:01:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:02] 10Operations, 10DC-Ops, 10decommission: decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `labstore2004.codfw.wmnet` - labstore2004.codfw.wmnet (**PASS... [19:03:24] (03PS1) 10Andrew Bogott: Remove references to labstore2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/566339 (https://phabricator.wikimedia.org/T243319) [19:05:09] (03PS1) 10Andrew Bogott: remove refs to labstore2003/2004 [dns] - 10https://gerrit.wikimedia.org/r/566341 (https://phabricator.wikimedia.org/T243319) [19:05:17] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to labstore2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/566339 (https://phabricator.wikimedia.org/T243319) (owner: 10Andrew Bogott) [19:06:08] (03CR) 10Andrew Bogott: [C: 03+2] remove refs to labstore2003/2004 [dns] - 10https://gerrit.wikimedia.org/r/566341 (https://phabricator.wikimedia.org/T243319) (owner: 10Andrew Bogott) [19:07:13] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Andrew) a:05Andrew→03Papaul [19:07:48] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Andrew) @papaul, I'm not positive but I think these servers have disk shelves attached. If so those shelves can also be decom'd at... [19:16:42] 10Operations, 10Wikimedia-Etherpad, 10serviceops: Migrate etherpad1001 to Buster - https://phabricator.wikimedia.org/T224580 (10Dzahn) The following packages are used by the puppet role but so far missing on buster: * prometheus-etherpad-exporter * etherpad-lite [19:19:53] 10Operations, 10netops: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 (10ayounsi) From JTAC: ` Been checking this issue with one of my seniors, on MX204 cr3-knams can we set on the below command: set routing-options ppm no-delegate-processing And cofig... [19:22:20] !log cr3-knams# set routing-options ppm no-delegate-processing - T240659 [19:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:23] T240659: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 [19:26:12] (03CR) 10Dzahn: [C: 03+2] acme_chief/gerrit: remove gerrit-new, add gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/565716 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [19:26:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10bd808) [19:26:31] 10Operations, 10DC-Ops, 10decommission, 10cloud-services-team (Hardware): decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10bd808) [19:30:41] 10Operations, 10netops: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 (10ayounsi) Note that the above probably reset the sessions, as they are now up. Leaving it configured until the issue happen again. [19:32:30] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Andrew) [19:33:07] (03PS1) 10Jhedden: openstack: depool cloudvirt1007 for ceph testing [puppet] - 10https://gerrit.wikimedia.org/r/566345 (https://phabricator.wikimedia.org/T243327) [19:33:13] (03PS4) 10Dzahn: gerrit: assign host gerrit1002 role::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [19:33:50] (03CR) 10Paladox: [C: 03+1] gerrit: assign host gerrit1002 role::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [19:35:53] 10Operations, 10DC-Ops, 10decommission: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Andrew) [19:36:04] (03PS5) 10Paladox: gerrit: assign host gerrit1002 role::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [19:36:16] (03PS6) 10Dzahn: gerrit: assign host gerrit1002 role::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [19:36:28] (03CR) 10Andrew Bogott: [C: 03+1] openstack: depool cloudvirt1007 for ceph testing [puppet] - 10https://gerrit.wikimedia.org/r/566345 (https://phabricator.wikimedia.org/T243327) (owner: 10Jhedden) [19:36:51] (03CR) 10Jhedden: [C: 03+2] openstack: depool cloudvirt1007 for ceph testing [puppet] - 10https://gerrit.wikimedia.org/r/566345 (https://phabricator.wikimedia.org/T243327) (owner: 10Jhedden) [19:38:59] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [19:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:41] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:44] (03PS3) 10Aaron Schulz: Use GTIDs for master position queries for external DB when possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 [19:39:44] 10Operations, 10DC-Ops, 10decommission: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `labstore2001.codfw.wmnet` - labstore2001.codfw.wmnet (**PASS... [19:39:48] !log mr1-esams> request system software add /var/tmp/junos-srxsme-18.2R3-S2... - T242097 [19:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:51] T242097: mr1-esams RMA (2020 edition) - https://phabricator.wikimedia.org/T242097 [19:39:55] 10Operations, 10DC-Ops, 10decommission: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Andrew) [19:39:55] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [19:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:36] 10Operations, 10DC-Ops, 10decommission: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `labstore2002.codfw.wmnet` - labstore2002.codfw.wmnet (**PASS... [19:41:27] (03PS1) 10Andrew Bogott: Remove references to labstore2001 and labstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/566347 (https://phabricator.wikimedia.org/T243329) [19:44:11] (03PS1) 10Andrew Bogott: Remove references to labstore2001 and labstore2002 [dns] - 10https://gerrit.wikimedia.org/r/566348 (https://phabricator.wikimedia.org/T243329) [19:44:22] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to labstore2001 and labstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/566347 (https://phabricator.wikimedia.org/T243329) (owner: 10Andrew Bogott) [19:44:45] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to labstore2001 and labstore2002 [dns] - 10https://gerrit.wikimedia.org/r/566348 (https://phabricator.wikimedia.org/T243329) (owner: 10Andrew Bogott) [19:45:00] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@1ca3071]: Add separate rule for machine vision jobs T241072 [19:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:03] T241072: Update the "waiting period" implemenation so as not to block the job queue - https://phabricator.wikimedia.org/T241072 [19:45:42] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Andrew) [19:46:11] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@1ca3071]: Add separate rule for machine vision jobs T241072 (duration: 01m 11s) [19:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:13] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Andrew) a:03Papaul [19:46:59] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Andrew) @papaul, please decom associated disk shelves when you pull these servers. Thank you! [19:47:14] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Andrew) [19:55:27] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) [19:55:51] !log restart mr1-esams for software upgrade - T242097 [19:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:54] T242097: mr1-esams RMA (2020 edition) - https://phabricator.wikimedia.org/T242097 [19:58:34] PROBLEM - Host asw2-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [19:59:42] !log puppet-compilers: syncing facts from puppetmasters to 3 compiler instances [19:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] brennen and twentyafterfour: How many deployers does it take to do Mediawiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200121T2000). [20:00:24] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:00:44] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:01:30] ^ expected too [20:01:30] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) [20:01:47] (03PS2) 10Dzahn: switch filtertags labs-project-phabricator to labs-project-devtools [puppet] - 10https://gerrit.wikimedia.org/r/565703 [20:02:01] (03CR) 10Brennen Bearnes: [C: 03+2] Group0 to 1.35.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566330 (owner: 10Brennen Bearnes) [20:03:00] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566330 (owner: 10Brennen Bearnes) [20:03:13] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:03:22] RECOVERY - Host asw2-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 86.43 ms [20:03:34] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:03:38] (03PS1) 10Andrew Bogott: codfw1dev: change the proxyagent password to be the same as eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/566351 [20:05:24] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: change the proxyagent password to be the same as eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/566351 (owner: 10Andrew Bogott) [20:07:19] (03PS1) 10Andrew Bogott: labtest-instances: change proxyuser password to be the same as eqiad1 [labs/private] - 10https://gerrit.wikimedia.org/r/566353 [20:07:53] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] labtest-instances: change proxyuser password to be the same as eqiad1 [labs/private] - 10https://gerrit.wikimedia.org/r/566353 (owner: 10Andrew Bogott) [20:08:15] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/20495/" [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [20:08:30] (03CR) 10Dzahn: [C: 03+2] switch filtertags labs-project-phabricator to labs-project-devtools [puppet] - 10https://gerrit.wikimedia.org/r/565703 (owner: 10Dzahn) [20:09:14] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.16 [20:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:56] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:53] (03CR) 10CRusnov: [C: 04-1] "THese changes should be applied to operations/netbox-extras from now on. This repository is deprecated." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/566321 (https://phabricator.wikimedia.org/T213843) (owner: 10Ayounsi) [20:13:08] (03CR) 10Dzahn: [C: 03+1] "the important part in the compiler output (which works now after syncing facts to compilers):" [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [20:13:21] (03PS7) 10Dzahn: gerrit: assign host gerrit1002 role::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [20:14:05] (03CR) 10Dzahn: [C: 03+2] gerrit: assign host gerrit1002 role::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/562587 (https://phabricator.wikimedia.org/T239151) (owner: 10Herron) [20:19:28] 10Operations, 10netops: mr1-esams RMA (2020 edition) - https://phabricator.wikimedia.org/T242097 (10ayounsi) Errors are still there... I asked for the next recommended steps. I can imagine it's either another RMA or to try a new optic: https://www.fs.com/products/13272.html [20:25:35] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) https://gerrit.wikimedia.org/r/q/topic:%22gerrit-test%22+(status:open%20OR%20status:merged) https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/gerrit1002... [20:29:56] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) The VM is now usable. It has the `role(gerrit)` on it and no more puppet errors. It uses its own service name/IP: https://gerrit-test.wikimedia.org Shell acces... [20:31:23] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) The mysql user has also been made configurable (along with backups / monitoring) and it is using: ` 104 hostname = m2-master.eqiad.wmnet 105 database... [20:33:38] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) ` 94 heapLimit = 5g 95 slave = false 116 canonicalWebUrl = https://gerrit-test.wikimedia.org/r 218 [sshd] 219 listenAddress = gerrit-test.wi... [20:34:12] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) 05Open→03Resolved a:03Dzahn [20:34:29] (03PS3) 10Dzahn: ferm_misc/db: allow connections from gerrit1002 in ferm [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) [20:37:16] 10Operations, 10netops: mr1-esams RMA (2020 edition) - https://phabricator.wikimedia.org/T242097 (10RobH) [20:37:47] (03PS4) 10Dzahn: ferm_misc/db: allow connections from gerrit1002 in ferm [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) [20:41:54] 10Operations, 10Gerrit, 10vm-requests, 10Patch-For-Review: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) The gerrit acmechief TLS cert has been updated to contain "gerrit-test" in addition to gerrit and gerrit-replica. The "gerrit-new" name has been removed from it.... [20:41:57] (03PS3) 10Jforrester: Document why we have duplicate false value for wmgEnableGeoData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565751 (https://phabricator.wikimedia.org/T183549) (owner: 10Ammarpad) [20:42:27] 10Operations, 10netops: mr1-esams RMA (2020 edition) - https://phabricator.wikimedia.org/T242097 (10RobH) Next steps: * @robh and @wiki_willy to order another optic (or two) via T243335 . * @robh to file remote hands request with iron mountain, for when the optics arrive ** remote hands include swapping exist... [20:42:30] (03CR) 10Dzahn: [C: 03+2] ferm_misc/db: allow connections from gerrit1002 in ferm [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [20:47:41] 10Operations, 10netops: mr1-esams i2c syslog flood - https://phabricator.wikimedia.org/T242097 (10ayounsi) [20:48:35] marostegui: do you just need a timeslot for https://gerrit.wikimedia.org/r/525147 ? [20:52:50] (03CR) 10Dzahn: "> We'll want this same entry in several other places, i.e. all the analytics (stats1007) diectories that are fetched. Is this the language" [puppet] - 10https://gerrit.wikimedia.org/r/565403 (owner: 10Dzahn) [20:57:24] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/565403 (owner: 10Dzahn) [21:07:36] (03Abandoned) 10Ayounsi: Extend PRODUCT_NAMES_IGNORE [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/566321 (https://phabricator.wikimedia.org/T213843) (owner: 10Ayounsi) [21:07:45] (03PS1) 10Ayounsi: Extend PRODUCT_NAMES_IGNORE [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566361 (https://phabricator.wikimedia.org/T213843) [21:11:40] (03PS1) 10Dzahn: gerrit: try switching rsync destination for gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/566362 [21:13:40] 10Operations: Retire the Tor relay - https://phabricator.wikimedia.org/T243288 (10Legoktm) > It's an outlier in our technical infrastructure and there is a range of issues with it Are these documented somewhere so that these could be fixed? > such as a lack of data center redundancy, inadequate observability s... [21:13:56] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/20496/" [puppet] - 10https://gerrit.wikimedia.org/r/566362 (owner: 10Dzahn) [21:16:49] (03CR) 10Ayounsi: "Not tested." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566361 (https://phabricator.wikimedia.org/T213843) (owner: 10Ayounsi) [21:17:08] (03CR) 10Paladox: [C: 03+1] gerrit: try switching rsync destination for gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/566362 (owner: 10Dzahn) [21:25:20] 10Operations, 10serviceops, 10Performance-Team (Radar): Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10aaron) [21:30:47] doesn't seem to be anyone on duty but is it ok to deploy parsoid at this time? [21:32:19] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1013 - https://phabricator.wikimedia.org/T242472 (10wiki_willy) a:03Jclark-ctr [21:33:04] (03PS1) 10Bstorm: dumps-distribution: the nfs export for dumps has changed [puppet] - 10https://gerrit.wikimedia.org/r/566365 (https://phabricator.wikimedia.org/T243328) [21:34:08] (03PS1) 10Dzahn: gerrit: allow multiple rsync destination hosts in migration class [puppet] - 10https://gerrit.wikimedia.org/r/566367 (https://phabricator.wikimedia.org/T239151) [21:36:49] (03Abandoned) 10Dzahn: gerrit: try switching rsync destination for gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/566362 (owner: 10Dzahn) [21:37:18] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/20497/gerrit1001.wikimedia.org/change.gerrit1001.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/566367 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [21:38:09] (03PS2) 10Dzahn: gerrit: allow multiple rsync destination hosts in migration class [puppet] - 10https://gerrit.wikimedia.org/r/566367 (https://phabricator.wikimedia.org/T239151) [21:40:54] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/20498/" [puppet] - 10https://gerrit.wikimedia.org/r/566367 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [21:43:41] (03CR) 10Dzahn: [C: 03+2] gerrit: allow multiple rsync destination hosts in migration class [puppet] - 10https://gerrit.wikimedia.org/r/566367 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [21:50:33] (03CR) 10Ottomata: "I'd also be fine with the /v2 URLs, but I think the team argued that URI versioning like that is great for APIs, but is a bit unusual for " [puppet] - 10https://gerrit.wikimedia.org/r/563508 (https://phabricator.wikimedia.org/T237752) (owner: 10Elukey) [21:51:35] (03PS2) 10Bstorm: dumps-distribution: the nfs export for dumps has changed [puppet] - 10https://gerrit.wikimedia.org/r/566365 (https://phabricator.wikimedia.org/T243328) [21:57:25] (03CR) 10Bstorm: "Puppet compiler: https://puppet-compiler.wmflabs.org/compiler1001/20499/notebook1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/566365 (https://phabricator.wikimedia.org/T243328) (owner: 10Bstorm) [21:57:45] (03CR) 10Ottomata: [C: 03+1] "Thanks for the symlink stuff!" [puppet] - 10https://gerrit.wikimedia.org/r/566365 (https://phabricator.wikimedia.org/T243328) (owner: 10Bstorm) [21:58:21] (03PS3) 10Bstorm: dumps-distribution: the nfs export for dumps has changed [puppet] - 10https://gerrit.wikimedia.org/r/566365 (https://phabricator.wikimedia.org/T243328) [21:59:52] (03CR) 10Bstorm: [C: 03+2] dumps-distribution: the nfs export for dumps has changed [puppet] - 10https://gerrit.wikimedia.org/r/566365 (https://phabricator.wikimedia.org/T243328) (owner: 10Bstorm) [22:06:46] 10Operations, 10hardware-requests: Expand Eqiad Ganeti row_A capacity - https://phabricator.wikimedia.org/T242885 (10wiki_willy) a:03RobH [22:08:35] (03PS1) 10Alex Monk: Create role::wmcs::openstack::codfw1dev::puppetmaster::frontend_vm to mirror eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/566369 [22:08:43] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:30] (03CR) 10jerkins-bot: [V: 04-1] Create role::wmcs::openstack::codfw1dev::puppetmaster::frontend_vm to mirror eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/566369 (owner: 10Alex Monk) [22:11:05] 10Operations, 10hardware-requests: Expand Eqiad Ganeti row_A capacity - https://phabricator.wikimedia.org/T242885 (10RobH) Ok, next steps for this as far as I can tell: DO NOT LIST PRICING ON THIS TASK AS IT IS PUBLIC. * create-sub task in #procurement and price out supported memory upgrade (despite these ho... [22:12:29] (03PS2) 10Alex Monk: openstack: Create codfw1dev puppetmaster frontend_vm role to mirror eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/566369 (https://phabricator.wikimedia.org/T242607) [22:12:29] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mstyles) a:05Mathew.onipe→03Mstyles [22:13:29] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:14:14] (03CR) 10Clarakosi: [C: 03+1] "LGTM but don't have +2 rights" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565696 (https://phabricator.wikimedia.org/T243106) (owner: 10Eevans) [22:14:40] (03CR) 10Andrew Bogott: [C: 03+2] openstack: Create codfw1dev puppetmaster frontend_vm role to mirror eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/566369 (https://phabricator.wikimedia.org/T242607) (owner: 10Alex Monk) [22:15:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:15:22] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10MusikAnimal) I ran into this when I was unable to block https://en.wikipedia.org/w/index.php?target=Printf%28%2... [22:18:25] 10Operations, 10DNS, 10Domains, 10Traffic: Donate wikiźródła.pl and wikisłownik.pl to the Foundation - https://phabricator.wikimedia.org/T240446 (10Dzahn) @tomasz Is @CRoslof looped into the conversation with Doneva? [22:23:36] (03CR) 10RLazarus: [C: 03+1] service::catalog: fix host header for check_http* monitoring [puppet] - 10https://gerrit.wikimedia.org/r/566250 (owner: 10Giuseppe Lavagetto) [22:29:30] 10Operations, 10serviceops, 10Performance-Team (Radar): Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10jijiki) @aaron non at all since it was codfw. On the other hand, we were a bit alarmed because of it, since we didn't expect s... [22:34:17] (03PS1) 10Dzahn: icinga/mediawiki: update jobrunner monitoring, add command to use a POST request [puppet] - 10https://gerrit.wikimedia.org/r/566374 (https://phabricator.wikimedia.org/T243096) [22:34:20] 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T242511 (10Cmjohnson) the disk was not delivered...HPE stated There was some issue in the delivery previously so we have reordered the part.The current ETA is 2020-01-23 10:30 local time. [22:36:36] 10Operations, 10serviceops, 10Performance-Team (Radar): Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10Krinkle) Looks like the main action is to avoid these alarms in the future, asking a few questions (some may be obvious): * D... [22:45:02] 10Operations, 10LDAP-Access-Requests: Requesting access to wmf LDAP group for dpifke - https://phabricator.wikimedia.org/T243354 (10dpifke) [23:02:41] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep - https://phabricator.wikimedia.org/T243226 (10Krenair) >>! In T243226#5819512, @jbond wrote: > however the bigger problem is the puppetdb server. this will need to be rebuilt on buster as it is not a simple task to backport the... [23:02:51] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep - https://phabricator.wikimedia.org/T243226 (10Krenair) a:05jbond→03Krenair [23:08:53] (03PS1) 10Alex Monk: role::puppetmaster::standalone: Add support for multiple PuppetDB hosts [puppet] - 10https://gerrit.wikimedia.org/r/566380 (https://phabricator.wikimedia.org/T243226) [23:32:26] (03PS1) 10Dzahn: contint: use package_from_component, stop using docker class [puppet] - 10https://gerrit.wikimedia.org/r/566383 [23:33:00] (03PS2) 10Dzahn: contint: use package_from_component, stop using docker class [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) [23:35:25] PROBLEM - parsoid on scandium is CRITICAL: connect to address 10.64.48.94 and port 8142: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [23:36:55] (03PS3) 10Dzahn: contint: use package_from_component, stop using docker class [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) [23:37:13] RECOVERY - parsoid on scandium is OK: HTTP OK: HTTP/1.1 200 OK - 1535 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [23:48:45] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Upgrade puppet in deployment-prep - https://phabricator.wikimedia.org/T243226 (10Krenair) Well, my new puppetdb instance seems to not be working very well yet: ` root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+14)# pu... [23:52:47] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Upgrade puppet in deployment-prep - https://phabricator.wikimedia.org/T243226 (10Krenair) rCLIPec5f9c1645d2eadc8db259755bf163c69e0409d6 Reloaded apache2 on deployment-puppetmaster03