[00:39:26] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 546 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:39:38] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 37 probes of 546 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:42:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 37 probes of 546 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:08:12] RECOVERY - snapshot of s8 in codfw on db1115 is OK: snapshot for s8 at codfw taken less than 3 days ago and larger than 90 GB: Last one 2020-03-29 21:28:00 from db2100.codfw.wmnet:3318 (1596 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:16:14] RECOVERY - snapshot of s8 in eqiad on db1115 is OK: snapshot for s8 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-03-29 21:23:08 from db1116.eqiad.wmnet:3318 (1539 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:28:22] RECOVERY - snapshot of s4 in codfw on db1115 is OK: snapshot for s4 at codfw taken less than 3 days ago and larger than 90 GB: Last one 2020-03-29 23:42:04 from db2099.codfw.wmnet:3314 (1159 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:44:34] PROBLEM - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - commonswiki_content_1582512959(60.666666666666664gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [02:48:42] RECOVERY - snapshot of s4 in eqiad on db1115 is OK: snapshot for s4 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-03-29 23:43:31 from db1102.eqiad.wmnet:3314 (1139 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [03:47:10] PROBLEM - snapshot of s5 in eqiad on db1115 is CRITICAL: snapshot for s5 at eqiad taken more than 3 days ago: Most recent backup 2020-03-27 03:19:29 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [03:52:50] PROBLEM - snapshot of s2 in eqiad on db1115 is CRITICAL: snapshot for s2 at eqiad taken more than 3 days ago: Most recent backup 2020-03-27 03:25:16 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:16:16] PROBLEM - snapshot of s2 in codfw on db1115 is CRITICAL: snapshot for s2 at codfw taken more than 3 days ago: Most recent backup 2020-03-27 03:50:08 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:18:18] RECOVERY - snapshot of s5 in eqiad on db1115 is OK: snapshot for s5 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-03-30 03:20:17 from db1102.eqiad.wmnet:3315 (665 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:25:18] (03CR) 10Vgutierrez: [C: 03+2] ATS: Re-enable TLS tickets in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/583948 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [04:32:45] !log upgrade ATS to version 8.0.6-1wm4 on ulsfo - T245616 [04:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:52] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [04:33:02] (03PS1) 10KartikMistry: apertium-bel-rus: Fix FTBFS with apertium 3.6 + 0.2.1 release [debs/contenttranslation/apertium-bel-rus] - 10https://gerrit.wikimedia.org/r/584282 (https://phabricator.wikimedia.org/T248812) [04:42:16] (03PS2) 10Vgutierrez: site: Reimage cp2027 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/583978 (https://phabricator.wikimedia.org/T247340) [04:45:49] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2027 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/583978 (https://phabricator.wikimedia.org/T247340) (owner: 10Vgutierrez) [04:47:20] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: expand cp nodes regex [puppet] - 10https://gerrit.wikimedia.org/r/583635 (https://phabricator.wikimedia.org/T247340) (owner: 10BBlack) [04:49:55] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2027.codfw.wmnet ` The... [04:55:10] !log Enable TLS Session tickets in ulsfo - T245616 [04:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:16] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [05:10:37] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [05:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:05] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:00] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2027.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2027.codfw.wmnet'] ` [05:40:24] !log pool cp2027 - T247340 [05:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:31] T247340: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 [05:48:22] (03PS1) 10Marostegui: db1107: Remove force 10.4 package [puppet] - 10https://gerrit.wikimedia.org/r/584289 [05:50:07] (03PS1) 10Vgutierrez: site: Reimage cp2028 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584290 (https://phabricator.wikimedia.org/T247340) [05:50:52] (03CR) 10Marostegui: [C: 03+2] db1107: Remove force 10.4 package [puppet] - 10https://gerrit.wikimedia.org/r/584289 (owner: 10Marostegui) [05:51:35] PROBLEM - snapshot of s7 in eqiad on db1115 is CRITICAL: snapshot for s7 at eqiad taken more than 3 days ago: Most recent backup 2020-03-27 05:34:47 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [05:52:11] (03PS2) 10Muehlenhoff: Don't include component/thirdparty-k8s on Buster [puppet] - 10https://gerrit.wikimedia.org/r/583923 [06:00:09] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [06:02:58] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2028 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584290 (https://phabricator.wikimedia.org/T247340) (owner: 10Vgutierrez) [06:03:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 for schema change', diff saved to https://phabricator.wikimedia.org/P10812 and previous config saved to /var/cache/conftool/dbconfig/20200330-060338-marostegui.json [06:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:07] !log Deploy schema change on db1074 with replication, this will generate lag on s2 labs [06:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:06] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2028.codfw.wmnet ` The... [06:08:14] (03CR) 10Muehlenhoff: [C: 03+2] Don't include component/thirdparty-k8s on Buster [puppet] - 10https://gerrit.wikimedia.org/r/583923 (owner: 10Muehlenhoff) [06:11:39] (03Abandoned) 10Giuseppe Lavagetto: mwdebug: switch tls termination from nginx to envoy [puppet] - 10https://gerrit.wikimedia.org/r/576853 (owner: 10Giuseppe Lavagetto) [06:12:18] (03PS6) 10Giuseppe Lavagetto: Switch restbase to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578497 (https://phabricator.wikimedia.org/T244843) [06:15:19] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10Vgutierrez) [06:15:22] (03CR) 10Giuseppe Lavagetto: "> Argh, the reason the generate() is failing on perfectly good" [puppet] - 10https://gerrit.wikimedia.org/r/584020 (owner: 10RLazarus) [06:18:11] RECOVERY - snapshot of s2 in codfw on db1115 is OK: snapshot for s2 at codfw taken less than 3 days ago and larger than 90 GB: Last one 2020-03-30 03:09:17 from db2098.codfw.wmnet:3312 (821 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:20:41] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2001 [puppet] - 10https://gerrit.wikimedia.org/r/584292 (https://phabricator.wikimedia.org/T248815) [06:22:04] (03CR) 10Muehlenhoff: [C: 03+2] Remove jessie support for puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/583953 (owner: 10Muehlenhoff) [06:26:49] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [06:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:08] PROBLEM - snapshot of s7 in codfw on db1115 is CRITICAL: snapshot for s7 at codfw taken more than 3 days ago: Most recent backup 2020-03-27 06:06:33 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:28:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1074 after schema change', diff saved to https://phabricator.wikimedia.org/P10813 and previous config saved to /var/cache/conftool/dbconfig/20200330-062858-marostegui.json [06:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:14] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:04] (03CR) 10Muehlenhoff: [C: 03+2] nagios: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/583931 (owner: 10Muehlenhoff) [06:31:32] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2028.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2028.codfw.wmnet'] ` [06:38:44] ACKNOWLEDGEMENT - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - commonswiki_content_1582512959(60.666666666666664gb) Elukey T246882 https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:46:49] (03CR) 10Muehlenhoff: [C: 03+2] postgres: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/581985 (owner: 10Muehlenhoff) [06:50:48] (03Abandoned) 10Muehlenhoff: service::node: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/566490 (owner: 10Muehlenhoff) [06:52:27] !log pool cp2028 - T247340 [06:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:33] T247340: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 [06:56:20] RECOVERY - snapshot of s2 in eqiad on db1115 is OK: snapshot for s2 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-03-30 03:26:39 from db1140.eqiad.wmnet:3312 (848 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:57:44] (03PS1) 10Muehlenhoff: puppetdb: Remove support for < buster [puppet] - 10https://gerrit.wikimedia.org/r/584382 [06:59:12] (03PS2) 10Muehlenhoff: puppetdb: Remove support for < buster [puppet] - 10https://gerrit.wikimedia.org/r/584382 [07:08:42] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [07:09:02] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) p:05Triage→03Medium [07:10:40] !log depool and decommission cp2001 - T248815 [07:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:45] T248815: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 [07:17:18] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [07:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:02] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2001.codfw.wmnet` - cp2001.codfw.wmnet (**PASS**) - Downtimed h... [07:26:39] !log Deploy schema change on s4 codfw, this will generate lag on codfw - T248333 [07:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:44] T248333: Schema change: Make page.page_restrictions column NULL - https://phabricator.wikimedia.org/T248333 [07:27:44] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2001 [puppet] - 10https://gerrit.wikimedia.org/r/584292 (https://phabricator.wikimedia.org/T248815) (owner: 10Vgutierrez) [07:27:59] (03PS2) 10Vgutierrez: site,install_server: Decommission cp2001 [puppet] - 10https://gerrit.wikimedia.org/r/584292 (https://phabricator.wikimedia.org/T248815) [07:28:42] !log Deploy schema change on labswiki (wikitech) - T248333 [07:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:21] (03PS1) 10Vgutierrez: Remove cp2001 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/584530 (https://phabricator.wikimedia.org/T248815) [07:33:26] PROBLEM - snapshot of s6 in eqiad on db1115 is CRITICAL: snapshot for s6 at eqiad taken more than 3 days ago: Most recent backup 2020-03-27 07:11:46 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [07:35:02] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2001 DNS entries [dns] - 10https://gerrit.wikimedia.org/r/584530 (https://phabricator.wikimedia.org/T248815) (owner: 10Vgutierrez) [07:37:21] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10Vgutierrez) a:05Vgutierrez→03Papaul [07:39:09] (03PS3) 10Marostegui: wmnet: Replace dbproxy1010 with dbproxy1018 [dns] - 10https://gerrit.wikimedia.org/r/534573 (https://phabricator.wikimedia.org/T231520) [07:39:27] (03PS3) 10Marostegui: wikireplica_dns: Replace dbproxy1010 with dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/534577 (https://phabricator.wikimedia.org/T231520) [07:40:17] !log Replace dbproxy1010 with dbproxy1011 for wiki replicas, analytics - T231520 [07:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:22] T231520: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 [07:40:26] (03CR) 10Marostegui: [C: 03+2] wmnet: Replace dbproxy1010 with dbproxy1018 [dns] - 10https://gerrit.wikimedia.org/r/534573 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [07:41:24] (03PS1) 10Vgutierrez: site: Reimage cp2029 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/584531 (https://phabricator.wikimedia.org/T248816) [07:41:39] (03CR) 10Marostegui: [C: 03+2] wikireplica_dns: Replace dbproxy1010 with dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/534577 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [07:43:50] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10Vgutierrez) [07:47:15] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2029 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/584531 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [07:48:15] !log Run cloudcontrol1003:~# wmcs-wikireplica-dns to promote dbproxy1018 to wikireplicas active proxy T231520 [07:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:21] T231520: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 [07:48:27] arturo: ^ [07:49:41] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2029.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [07:50:58] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:58] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:51:45] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:52:07] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:53:01] RECOVERY - snapshot of s7 in eqiad on db1115 is OK: snapshot for s7 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-03-30 05:36:01 from db1116.eqiad.wmnet:3317 (933 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [07:53:28] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2002 [puppet] - 10https://gerrit.wikimedia.org/r/584537 (https://phabricator.wikimedia.org/T248818) [07:53:44] !log depool & decommission cp2002 - T248818 [07:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:50] T248818: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 [07:54:33] PROBLEM - Confd template for /srv/config-master/pybal/esams/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:54:33] PROBLEM - Confd template for /srv/config-master/pybal/esams/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:54:44] uh? [07:56:53] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:56:53] PROBLEM - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:56:53] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:56:53] PROBLEM - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:57:02] nice... [07:58:59] RECOVERY - snapshot of s7 in codfw on db1115 is OK: snapshot for s7 at codfw taken less than 3 days ago and larger than 90 GB: Last one 2020-03-30 05:25:54 from db2100.codfw.wmnet:3317 (954 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:00:12] 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) @Bstorm @bd808 I have pushed https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/534577/3/modules/openstack/files/util... [08:02:17] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2002 [puppet] - 10https://gerrit.wikimedia.org/r/584537 (https://phabricator.wikimedia.org/T248818) (owner: 10Vgutierrez) [08:02:25] (03PS2) 10Vgutierrez: site,install_server: Decommission cp2002 [puppet] - 10https://gerrit.wikimedia.org/r/584537 (https://phabricator.wikimedia.org/T248818) [08:02:29] (03PS1) 10Marostegui: dbproxy101: Clarify that it is not an active proxy [puppet] - 10https://gerrit.wikimedia.org/r/584539 (https://phabricator.wikimedia.org/T231520) [08:03:52] (03PS4) 10Gergő Tisza: Deploy GrowthExperiments on Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) [08:03:54] (03PS4) 10Gergő Tisza: Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) [08:03:56] (03PS3) 10Gergő Tisza: Enable GrowthExperiments on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) [08:03:58] (03PS2) 10Gergő Tisza: Remove no-op GrowthExperiments beta settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584184 [08:03:59] came on CI... [08:04:01] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:04:12] (03CR) 10Marostegui: [C: 03+2] dbproxy101: Clarify that it is not an active proxy [puppet] - 10https://gerrit.wikimedia.org/r/584539 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [08:04:30] marostegui: may I merge your change? [08:04:36] yes please [08:04:43] I tried to run it but it was locked by you [08:05:10] done [08:05:26] thank you :* [08:05:51] the confd mess should been fixed already [08:05:56] :9 [08:06:03] 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) a:03Marostegui [08:06:33] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:08:47] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:08:49] RECOVERY - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:08:49] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:08:53] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:08:53] RECOVERY - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:08:55] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:08:57] RECOVERY - Confd template for /srv/config-master/pybal/esams/text on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:08:59] RECOVERY - Confd template for /srv/config-master/pybal/esams/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:09:55] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:10:01] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:10:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:31] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:10:31] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:12:04] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [08:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:44] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2002.codfw.wmnet` - cp2002.codfw.wmnet (**PASS**) - Downtim... [08:12:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:55] (03PS1) 10Vgutierrez: Remove cp2002 entries [dns] - 10https://gerrit.wikimedia.org/r/584540 (https://phabricator.wikimedia.org/T248818) [08:15:19] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2029.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2029.codfw.wmnet'] ` [08:16:14] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [08:19:40] (03PS1) 10Filippo Giunchedi: prometheus: break rules into logical groups [puppet] - 10https://gerrit.wikimedia.org/r/584541 [08:27:23] (03PS1) 10Filippo Giunchedi: prometheus: add recording rules for disk activity [puppet] - 10https://gerrit.wikimedia.org/r/584542 [08:30:51] !log pool cp2029 - T248816 [08:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:57] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [08:31:44] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2002 entries [dns] - 10https://gerrit.wikimedia.org/r/584540 (https://phabricator.wikimedia.org/T248818) (owner: 10Vgutierrez) [08:31:54] (03CR) 10Jcrespo: "The unit tests seem way more complex than they should- either they should not be unit test and have less mocking or they should be simplif" (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [08:33:33] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10Vgutierrez) a:05Vgutierrez→03Papaul [08:34:12] RECOVERY - snapshot of s6 in eqiad on db1115 is OK: snapshot for s6 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-03-30 07:33:20 from db1139.eqiad.wmnet:3316 (519 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:38:38] (03PS1) 10Vgutierrez: site: Remove cp2029 from insetup_noferm role regex [puppet] - 10https://gerrit.wikimedia.org/r/584544 (https://phabricator.wikimedia.org/T248818) [08:39:27] 10Operations, 10Traffic: check_trafficserver_log_fifo: false positives when changing log format - https://phabricator.wikimedia.org/T248067 (10ema) During the weekend we had issues with ats-tls and the number of open file descriptors, see T248736. Due to this, check_trafficserver_log_fifo started timing out.... [08:39:39] (03CR) 10Vgutierrez: [C: 03+2] site: Remove cp2029 from insetup_noferm role regex [puppet] - 10https://gerrit.wikimedia.org/r/584544 (https://phabricator.wikimedia.org/T248818) (owner: 10Vgutierrez) [08:40:00] PROBLEM - snapshot of x1 in eqiad on db1115 is CRITICAL: snapshot for x1 at eqiad taken more than 3 days ago: Most recent backup 2020-03-27 08:18:24 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:43:17] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10ema) 05Resolved→03Open The patch disabling transaction_active_timeout was reverted by @Vgutierrez (th... [08:44:43] (03PS1) 10Filippo Giunchedi: prometheus: add http redirects for the default instance [puppet] - 10https://gerrit.wikimedia.org/r/584545 [08:45:55] (03PS1) 10Vgutierrez: site: Reimage cp2030 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584546 (https://phabricator.wikimedia.org/T248818) [08:52:00] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2030 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584546 (https://phabricator.wikimedia.org/T248818) (owner: 10Vgutierrez) [08:54:25] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2030.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202003300854_vgutierr... [08:54:56] PROBLEM - snapshot of x1 in codfw on db1115 is CRITICAL: snapshot for x1 at codfw taken more than 3 days ago: Most recent backup 2020-03-27 08:41:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:57:09] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Vgutierrez) [08:57:34] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Vgutierrez) a:03Vgutierrez [08:58:42] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [08:59:31] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [09:00:05] (03CR) 10Jcrespo: "Functionality-wise, it passes CI, and if the patch is not applied, the 3 new tests fail, so works as intended:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [09:05:04] PROBLEM - snapshot of s3 in eqiad on db1115 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2020-03-27 08:37:39 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:05:18] (03PS1) 10Alexandros Kosiaris: Update to 5.0.41 templates [software/otrs] - 10https://gerrit.wikimedia.org/r/584549 [09:15:07] ACKNOWLEDGEMENT - snapshot of s3 in eqiad on db1115 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2020-03-27 08:37:39 Jcrespo ongoing backups, will need patch to avoid spam https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:15:07] ACKNOWLEDGEMENT - snapshot of x1 in codfw on db1115 is CRITICAL: snapshot for x1 at codfw taken more than 3 days ago: Most recent backup 2020-03-27 08:41:06 Jcrespo ongoing backups, will need patch to avoid spam https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:15:07] ACKNOWLEDGEMENT - snapshot of x1 in eqiad on db1115 is CRITICAL: snapshot for x1 at eqiad taken more than 3 days ago: Most recent backup 2020-03-27 08:18:24 Jcrespo ongoing backups, will need patch to avoid spam https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:15:11] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:46] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:03] (03PS1) 10Jcrespo: mariadb-backups: Move friday backup to Saturday to avoid 3-day old ones [puppet] - 10https://gerrit.wikimedia.org/r/584552 [09:20:20] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2030.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2030.codfw.wmnet'] ` [09:21:59] (03PS1) 10Ema: systemd: add support for network accounting [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) [09:26:02] RECOVERY - snapshot of x1 in codfw on db1115 is OK: snapshot for x1 at codfw taken less than 3 days ago and larger than 90 GB: Last one 2020-03-30 08:07:27 from db2101.codfw.wmnet:3320 (188 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:26:38] PROBLEM - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2020-03-27 09:09:41 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:27:39] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/21618/" [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [09:30:04] hoo: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200330T0930). [09:31:51] 10Operations, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10Volans) The idea sounds good to me and surely help to simplify one main use case. One possible addition could be to have a single long-liv... [09:36:44] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Move friday backup to Saturday to avoid 3-day old ones [puppet] - 10https://gerrit.wikimedia.org/r/584552 (owner: 10Jcrespo) [09:41:16] (03PS3) 10Ema: varnish: Remove duplicate 'Content-Type: text/html' statement [puppet] - 10https://gerrit.wikimedia.org/r/558752 (owner: 10Krinkle) [09:41:54] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Move friday backup to Saturday to avoid 3-day old ones [puppet] - 10https://gerrit.wikimedia.org/r/584552 (owner: 10Jcrespo) [09:42:12] RECOVERY - snapshot of x1 in eqiad on db1115 is OK: snapshot for x1 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-03-30 08:44:03 from db1140.eqiad.wmnet:3320 (160 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:43:33] (03PS1) 10Alexandros Kosiaris: otrs: Add 2 new CPAN dependencies [puppet] - 10https://gerrit.wikimedia.org/r/584557 (https://phabricator.wikimedia.org/T248814) [09:43:47] ACKNOWLEDGEMENT - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2020-03-27 09:09:41 Jcrespo backups running, https://gerrit.wikimedia.org/r/c/operations/puppet/+/584552 should solve it next week https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:48:13] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10Jpita) >>! In T247722#6008461, @Aklapper wrote: > @Jpita: On wikitech, see the previous comment. (If your question was about logging into the `Jose pita` account.) h... [09:49:12] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/21617/" [puppet] - 10https://gerrit.wikimedia.org/r/584545 (owner: 10Filippo Giunchedi) [09:52:30] (03CR) 10Ema: [C: 03+2] "I've updated 21-sec-warning.vtc to test that Content-Type is set to the expected value. Other than that the patch looks good and tests are" [puppet] - 10https://gerrit.wikimedia.org/r/558752 (owner: 10Krinkle) [09:56:29] !log hoo@deploy1001 Synchronized php-1.35.0-wmf.25/extensions/Wikibase/repo/maintenance/DumpEntities.php: DumpEntities: Fix DB group default override (T248612) (duration: 01m 02s) [09:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:35] T248612: Wikibase dump scripts fail as they can't find 'Wikibase\Lib\WikibaseSettings' - https://phabricator.wikimedia.org/T248612 [09:59:26] !log Temporary modified dumpsgen's crontab on snapshot1008 so that the Wikidata JSON dumps start at 9:59 UTC today (T248612) [09:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The change looks correct to me, but I think we should consider making the upstream address overridable by service as a followup." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [10:02:17] (03PS1) 10Filippo Giunchedi: install_server: remove obsolete custom recipes [puppet] - 10https://gerrit.wikimedia.org/r/584559 [10:03:34] (03CR) 10jerkins-bot: [V: 04-1] install_server: remove obsolete custom recipes [puppet] - 10https://gerrit.wikimedia.org/r/584559 (owner: 10Filippo Giunchedi) [10:07:36] (03PS1) 10Ema: varnish: remove X-Webauth-User VTC leftovers [puppet] - 10https://gerrit.wikimedia.org/r/584561 (https://phabricator.wikimedia.org/T246508) [10:11:45] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/584559 (owner: 10Filippo Giunchedi) [10:14:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "systemd-analyze is present in stretch as well, we just need to install systemd https://gerrit.wikimedia.org/r/#/c/integration/config/+/584" [puppet] - 10https://gerrit.wikimedia.org/r/584020 (owner: 10RLazarus) [10:16:29] (03CR) 10Dzahn: [C: 03+2] add IPv6 for miscweb2002 [dns] - 10https://gerrit.wikimedia.org/r/583980 (owner: 10Dzahn) [10:16:58] (03PS1) 10Ema: varnish: use buster box for VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/584563 [10:19:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: introduce role/profile for legacy URL redirector [puppet] - 10https://gerrit.wikimedia.org/r/583593 (https://phabricator.wikimedia.org/T247236) (owner: 10Arturo Borrero Gonzalez) [10:20:12] (03CR) 10Dzahn: [C: 03+2] add miscweb2002.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/583970 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [10:20:17] (03PS2) 10Dzahn: add miscweb2002.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/583970 (https://phabricator.wikimedia.org/T247887) [10:22:56] (03PS2) 10Dzahn: add IPv6 for miscweb2002 [dns] - 10https://gerrit.wikimedia.org/r/583980 [10:24:21] (03PS9) 10Jbond: profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 [10:24:36] (03CR) 10Jbond: "thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [10:25:21] (03CR) 10Jbond: [C: 03+2] envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 (owner: 10Jbond) [10:25:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: remove the old docker builder code [puppet] - 10https://gerrit.wikimedia.org/r/584059 (https://phabricator.wikimedia.org/T248703) (owner: 10Bstorm) [10:30:04] jan_drewniak: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200330T1030). [10:32:12] (03PS10) 10Jbond: profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 [10:32:14] (03PS4) 10Jbond: idp: update the idp proxy config to use localhost and use_remote_address [puppet] - 10https://gerrit.wikimedia.org/r/583368 [10:32:47] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [10:33:52] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [10:33:52] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [10:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:16] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [10:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/584382 (owner: 10Muehlenhoff) [10:43:23] 10Operations, 10Patch-For-Review, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10Joe) @RLazarus this is done, right? Is there anything left to do? [10:43:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [10:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:19] (03PS1) 10Dzahn: DHCP: add miscweb2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/584566 (https://phabricator.wikimedia.org/T247887) [10:47:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We still need to monitor nutcracker with high frequency, but that might go away as soon as risky deployments are resumed (meaning we'll fi" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [10:48:07] (03CR) 10Dzahn: [C: 03+2] DHCP: add miscweb2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/584566 (https://phabricator.wikimedia.org/T247887) (owner: 10Dzahn) [10:50:07] (03PS2) 10Dzahn: scap: add codfw canary appservers to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/574902 (https://phabricator.wikimedia.org/T242606) [10:53:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [10:54:40] (03CR) 10Jbond: [C: 03+2] idp: update the idp proxy config to use localhost and use_remote_address [puppet] - 10https://gerrit.wikimedia.org/r/583368 (owner: 10Jbond) [10:54:44] (03CR) 10Jbond: [C: 03+2] profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [10:57:31] (03PS3) 10Dzahn: scap: add codfw canary appservers to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/574902 (https://phabricator.wikimedia.org/T242606) [10:58:36] (03PS12) 10L0st3xpl0r3r: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200330T1100). [11:00:04] RhinosF1 and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] I can SWAT today! [11:00:24] hi [11:00:29] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10Aklapper) @Jpita: See the question in T247722#6004734 which is not about `Jpita` but about `Jose pita`. [11:00:44] (03CR) 10Dzahn: [C: 03+2] scap: add codfw canary appservers to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/574902 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [11:00:50] (03CR) 10Urbanecm: [C: 03+2] "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584172 (https://phabricator.wikimedia.org/T248734) (owner: 10RhinosF1) [11:01:34] RhinosF1: i'll ping you when it's ready for testing [11:01:40] thx [11:01:48] (03Merged) 10jenkins-bot: Add 3 additional namespaces and assoicated talk pages to trwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584172 (https://phabricator.wikimedia.org/T248734) (owner: 10RhinosF1) [11:01:56] (03PS13) 10L0st3xpl0r3r: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) [11:02:57] RhinosF1: pulled onto mwdebug1001 [11:03:54] (03CR) 10Urbanecm: [C: 03+2] Add collections.nmnh.si.edu to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584126 (https://phabricator.wikimedia.org/T248659) (owner: 10Urbanecm) [11:03:59] (03PS14) 10L0st3xpl0r3r: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) [11:04:10] (03PS1) 10Jbond: tlsproxy::envoy: fix rspec and TODO comments [puppet] - 10https://gerrit.wikimedia.org/r/584569 [11:04:28] Urbanecm: LGTM! [11:04:35] syncing [11:05:37] (03PS3) 10Urbanecm: Add collections.nmnh.si.edu to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584126 (https://phabricator.wikimedia.org/T248659) [11:05:44] (03CR) 10Urbanecm: [C: 03+2] Add collections.nmnh.si.edu to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584126 (https://phabricator.wikimedia.org/T248659) (owner: 10Urbanecm) [11:06:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: c8c06f9: Add 3 additional namespaces and assoicated talk pages to trwiktionary (T248734) (duration: 00m 59s) [11:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:09] T248734: Adding some equivalent namespaces of English Wiktionary to Turkish Wiktionary - https://phabricator.wikimedia.org/T248734 [11:06:54] (03Merged) 10jenkins-bot: Add collections.nmnh.si.edu to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584126 (https://phabricator.wikimedia.org/T248659) (owner: 10Urbanecm) [11:07:01] Thanks Urbanecm [11:07:28] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: c8c06f9: Add 3 additional namespaces and assoicated talk pages to trwiktionary (T248734; take II) (duration: 00m 59s) [11:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:50] RhinosF1: done [11:07:51] (03PS1) 10Dzahn: site: add miscweb2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/584570 (https://phabricator.wikimedia.org/T247648) [11:08:01] Seen in live [11:08:06] (03CR) 10Jbond: [C: 03+2] tlsproxy::envoy: fix rspec and TODO comments [puppet] - 10https://gerrit.wikimedia.org/r/584569 (owner: 10Jbond) [11:08:09] (03CR) 10L0st3xpl0r3r: "> Patch Set 11:" (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [11:08:23] !log pool cp2030 - T248816 [11:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:28] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [11:08:44] (03CR) 10Dzahn: [C: 03+2] site: add miscweb2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/584570 (https://phabricator.wikimedia.org/T247648) (owner: 10Dzahn) [11:09:02] RECOVERY - snapshot of s3 in eqiad on db1115 is OK: snapshot for s3 at eqiad taken less than 3 days ago and larger than 90 GB: Last one 2020-03-30 09:06:09 from db1095.eqiad.wmnet:3313 (836 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [11:09:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: ac7e625: Add collections.nmnh.si.edu to $wgCopyUploadsDomains (T248659) (duration: 00m 58s) [11:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:28] T248659: Add collections.nmnh.si.edu to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T248659 [11:09:38] (03PS2) 10Dzahn: site: add miscweb2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/584570 (https://phabricator.wikimedia.org/T247648) [11:10:26] (03PS6) 10Filippo Giunchedi: icinga: relax check interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) [11:10:55] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: relax check interval for selected checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [11:10:57] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: ac7e625: Add collections.nmnh.si.edu to $wgCopyUploadsDomains (T248659; take II) (duration: 00m 58s) [11:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:05] !log EU SWAT done [11:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:20] (03PS1) 10Dzahn: Revert "scap: add codfw canary appservers to dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/584571 [11:12:38] (03PS1) 10Vgutierrez: cache: Fix profile::cache::base::default_weights [puppet] - 10https://gerrit.wikimedia.org/r/584572 [11:13:20] (03CR) 10Urbanecm: [C: 04-1] "I think you should add the wiki to growthexperiments dblist as well" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) (owner: 10Gergő Tisza) [11:13:37] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) (owner: 10Gergő Tisza) [11:14:00] (03CR) 10Dzahn: [C: 03+2] Revert "scap: add codfw canary appservers to dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/584571 (owner: 10Dzahn) [11:14:18] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) (owner: 10Gergő Tisza) [11:19:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:22:48] !log delete ARIN allocations from RIPE's IRR - T235886 [11:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:53] T235886: IRR updates needed - https://phabricator.wikimedia.org/T235886 [11:23:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:24:42] (03PS1) 10KartikMistry: Enable ContentTranslation in Lithuanian Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584574 (https://phabricator.wikimedia.org/T248179) [11:25:35] (03CR) 10Jcrespo: ";-)" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [11:26:12] (03PS1) 10Arturo Borrero Gonzalez: toolforge: legacy_redirector: fix toolsbeta support [puppet] - 10https://gerrit.wikimedia.org/r/584575 (https://phabricator.wikimedia.org/T247236) [11:26:52] (03CR) 10Ema: [C: 03+1] cache: Fix profile::cache::base::default_weights [puppet] - 10https://gerrit.wikimedia.org/r/584572 (owner: 10Vgutierrez) [11:27:30] (03CR) 10Ema: [C: 03+2] varnish: remove X-Webauth-User VTC leftovers [puppet] - 10https://gerrit.wikimedia.org/r/584561 (https://phabricator.wikimedia.org/T246508) (owner: 10Ema) [11:27:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:28:03] (03CR) 10Ema: [C: 03+2] varnish: use buster box for VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/584563 (owner: 10Ema) [11:29:36] (03PS2) 10Arturo Borrero Gonzalez: toolforge: legacy_redirector: fix toolsbeta support [puppet] - 10https://gerrit.wikimedia.org/r/584575 (https://phabricator.wikimedia.org/T247236) [11:30:07] (03CR) 10Jcrespo: "Tip: You can run "pycodestyle --max-line-length=100 transfer.py" to check those warnings locally before submitting:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [11:30:53] !log Deploy schema change on dbstore1004:3314 [11:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:32:48] (03CR) 10Vgutierrez: [C: 03+2] cache: Fix profile::cache::base::default_weights [puppet] - 10https://gerrit.wikimedia.org/r/584572 (owner: 10Vgutierrez) [11:35:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: legacy_redirector: fix toolsbeta support [puppet] - 10https://gerrit.wikimedia.org/r/584575 (https://phabricator.wikimedia.org/T247236) (owner: 10Arturo Borrero Gonzalez) [11:37:30] !log miscweb2002 - installed OS, added to puppet, added role and ... sed -i 's/tin.eqiad/deployment.eqiad/g' /srv/deployment/iegreview/iegreview-cache/.config (T247648) [11:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:36] T247648: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 [11:40:29] (03PS5) 10Ema: ATS: unset debug HTTP headers for normal requests [puppet] - 10https://gerrit.wikimedia.org/r/583570 (https://phabricator.wikimedia.org/T210484) [11:42:12] !log miscweb2002 - race condition with apache2 mpm and php7.3 module met - a2dismond mpm_event ; systemctl restart apache2 ; puppet agent -tv (also see T196968, https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206) T247887 [11:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:18] T196968: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 [11:42:19] T247887: Site: eqiad/codfw 2 VM request for miscweb - https://phabricator.wikimedia.org/T247887 [11:43:33] (03PS5) 10Gergő Tisza: Deploy GrowthExperiments on Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) [11:43:35] (03PS5) 10Gergő Tisza: Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) [11:43:37] (03PS4) 10Gergő Tisza: Enable GrowthExperiments on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) [11:43:39] (03PS3) 10Gergő Tisza: Remove no-op GrowthExperiments beta settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584184 [11:43:41] (03PS1) 10Gergő Tisza: Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 [11:44:06] (03CR) 10Gergő Tisza: "Oops, totally forgot about that. Fixed here and in the srwiki patch, and fixed the current disparity in I577b10706." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) (owner: 10Gergő Tisza) [11:45:09] (03CR) 10jerkins-bot: [V: 04-1] Deploy GrowthExperiments on Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) (owner: 10Gergő Tisza) [11:45:34] (03CR) 10jerkins-bot: [V: 04-1] Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) (owner: 10Gergő Tisza) [11:45:57] (03CR) 10jerkins-bot: [V: 04-1] Enable GrowthExperiments on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) (owner: 10Gergő Tisza) [11:46:38] (03CR) 10jerkins-bot: [V: 04-1] Remove no-op GrowthExperiments beta settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584184 (owner: 10Gergő Tisza) [11:46:57] (03CR) 10CDanis: [C: 03+1] prometheus: add http redirects for the default instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584545 (owner: 10Filippo Giunchedi) [11:46:59] (03CR) 10jerkins-bot: [V: 04-1] Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (owner: 10Gergő Tisza) [11:47:22] (03PS3) 10Gergő Tisza: Alphabetize GrowthExperiments settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584132 [11:47:24] (03PS2) 10Gergő Tisza: Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) [11:47:28] (03PS6) 10Gergő Tisza: Deploy GrowthExperiments on Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) [11:47:30] (03PS6) 10Gergő Tisza: Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) [11:47:32] (03PS5) 10Gergő Tisza: Enable GrowthExperiments on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) [11:47:34] (03PS4) 10Gergő Tisza: Remove no-op GrowthExperiments beta settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584184 [11:48:40] (03CR) 10jerkins-bot: [V: 04-1] Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) (owner: 10Gergő Tisza) [11:49:07] (03CR) 10jerkins-bot: [V: 04-1] Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) (owner: 10Gergő Tisza) [11:49:23] (03CR) 10jerkins-bot: [V: 04-1] Deploy GrowthExperiments on Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) (owner: 10Gergő Tisza) [11:49:52] (03CR) 10jerkins-bot: [V: 04-1] Enable GrowthExperiments on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) (owner: 10Gergő Tisza) [11:50:35] (03CR) 10jerkins-bot: [V: 04-1] Remove no-op GrowthExperiments beta settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584184 (owner: 10Gergő Tisza) [11:50:41] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:52:37] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 22376 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:52:41] (03CR) 10CDanis: [C: 03+1] prometheus: break rules into logical groups [puppet] - 10https://gerrit.wikimedia.org/r/584541 (owner: 10Filippo Giunchedi) [11:53:32] (03CR) 10CDanis: prometheus: add recording rules for disk activity (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584542 (owner: 10Filippo Giunchedi) [11:55:57] 10Operations, 10serviceops, 10Patch-For-Review: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [11:55:59] 10Operations, 10vm-requests, 10Patch-For-Review: Site: eqiad/codfw 2 VM request for miscweb - https://phabricator.wikimedia.org/T247887 (10Dzahn) [11:56:07] 10Operations, 10vm-requests, 10Patch-For-Review: Site: eqiad/codfw 2 VM request for miscweb - https://phabricator.wikimedia.org/T247887 (10Dzahn) 05Open→03Resolved a:03Dzahn [11:56:08] 10Operations, 10serviceops, 10Patch-For-Review: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [11:56:20] 10Operations, 10vm-requests: Site: eqiad/codfw 2 VM request for miscweb - https://phabricator.wikimedia.org/T247887 (10Dzahn) [11:56:45] (03PS1) 10Dzahn: ATS: switch scholarships.wm.org to miscweb1002 backend [puppet] - 10https://gerrit.wikimedia.org/r/584581 (https://phabricator.wikimedia.org/T247648) [11:58:27] (03CR) 10Dzahn: [C: 03+2] "SANs have been added to the cert, mysql access confirmed working, App is not used currently anyways." [puppet] - 10https://gerrit.wikimedia.org/r/584581 (https://phabricator.wikimedia.org/T247648) (owner: 10Dzahn) [11:58:47] (03PS2) 10Dzahn: ATS: switch scholarships.wm.org to miscweb1002 backend [puppet] - 10https://gerrit.wikimedia.org/r/584581 (https://phabricator.wikimedia.org/T247648) [11:59:49] !log delete unused ROAs for RIPE prefixes - T235886 [11:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:55] T235886: IRR updates needed - https://phabricator.wikimedia.org/T235886 [12:03:22] !log delete unused ROA for ARIN v6 prefixes - T235886 [12:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:37] PROBLEM - traffic_server backend process restarted on cp2013 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2013&var-layer=backend [12:16:07] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10Vgutierrez) [12:16:19] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [12:16:23] (03PS1) 10Arturo Borrero Gonzalez: toolforge: legacy redirector: fix/improve table of allowed tools [puppet] - 10https://gerrit.wikimedia.org/r/584582 (https://phabricator.wikimedia.org/T247236) [12:17:42] (03PS1) 10Dzahn: ATS: switch iegreview.wm.org to miscweb1002 backend [puppet] - 10https://gerrit.wikimedia.org/r/584583 (https://phabricator.wikimedia.org/T247648) [12:18:38] PROBLEM - Check systemd state on boron is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:38] 10Operations, 10netops: IRR updates needed - https://phabricator.wikimedia.org/T235886 (10ayounsi) The last unused ARIN ROA was bundled with the one used for ulsfo. I created a new dedicated one for ulsfo and will delete the old one tomorrow. I don't think it's possible to delete the arin-whois as they match... [12:20:01] (03CR) 10Muehlenhoff: [C: 03+1] ATS: switch iegreview.wm.org to miscweb1002 backend [puppet] - 10https://gerrit.wikimedia.org/r/584583 (https://phabricator.wikimedia.org/T247648) (owner: 10Dzahn) [12:20:15] James_F: do we need to run composer buildDBLists before committing any change to dblist yaml files these days? [12:20:20] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:20:44] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2004 [puppet] - 10https://gerrit.wikimedia.org/r/584584 (https://phabricator.wikimedia.org/T248824) [12:20:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: legacy redirector: fix/improve table of allowed tools [puppet] - 10https://gerrit.wikimedia.org/r/584582 (https://phabricator.wikimedia.org/T247236) (owner: 10Arturo Borrero Gonzalez) [12:20:51] tgr: you do, I wonder if James_F ever copied the guide from my user-space [12:20:55] might be worth a mention in the readme file, or a separate readme file in /dblistis or something like that [12:21:28] !log depool & decommission cp2004 - T248824 [12:21:29] currently it seems you can only reverse-engineer it from the CI errors [12:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:33] T248824: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 [12:23:36] PROBLEM - Check that envoy is running on idp2001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [12:23:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [12:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [12:24:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [12:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:28] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2004.codfw.wmnet` - cp2004.codfw.wmnet (**PASS**) -... [12:24:44] RECOVERY - Check that envoy is running on idp2001 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [12:25:08] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2004 [puppet] - 10https://gerrit.wikimedia.org/r/584584 (https://phabricator.wikimedia.org/T248824) (owner: 10Vgutierrez) [12:26:14] !log cdanis@re0.cr2-codfw# set chassis fpc 5 inline-services flex-flow-sizing cdanis@re0.cr2-codfw# commit comment "flex-flow-sizing T248394" [12:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:20] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [12:26:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, but adding Ariel and Jaime (last one to touch aid1-lvm-ext4-srv-plus-hwraid.cfg)" [puppet] - 10https://gerrit.wikimedia.org/r/584559 (owner: 10Filippo Giunchedi) [12:27:02] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.0 200 OK - 22372 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:28:15] (03PS1) 10Vgutierrez: Remove cp2004 entries [dns] - 10https://gerrit.wikimedia.org/r/584586 (https://phabricator.wikimedia.org/T248824) [12:30:35] vgutierrez: could I do the same with cp1099? [12:30:53] 10Operations, 10netops, 10Patch-For-Review: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [12:30:54] RECOVERY - snapshot of s3 in codfw on db1115 is OK: snapshot for s3 at codfw taken less than 3 days ago and larger than 90 GB: Last one 2020-03-30 08:54:05 from db2098.codfw.wmnet:3313 (838 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [12:30:56] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2004 entries [dns] - 10https://gerrit.wikimedia.org/r/584586 (https://phabricator.wikimedia.org/T248824) (owner: 10Vgutierrez) [12:31:27] mutante: decomm it? [12:31:31] mutante: or clean the DNS entries? [12:32:09] vgutierrez: both. afaict the status is "shut down" but the decom script has not been run [12:32:12] PROBLEM - traffic_server backend process restarted on cp2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2010&var-layer=backend [12:32:15] and i have clean up patches [12:32:35] mutante: it was being actually used as a cp server? [12:32:38] "ticket says erroneously it is already [12:32:39] removed from puppet repo and DNS which it is not" [12:33:05] vgutierrez: yes, it was part of https://phabricator.wikimedia.org/T229586 [12:33:18] mutante: I'm kinda paranoid with those and before running the decomm cookbook I wipe /etc/acmecerts && /etc/ssl/private/ [12:33:36] otherwise, go for it [12:33:53] vgutierrez: using "shred" ? [12:34:09] I'm using srm provided by the secure-delete debian package [12:34:53] (03PS1) 10Muehlenhoff: cumin: Fix Python version for Buster and remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/584587 [12:35:04] ah! ok. maybe we should add it to base packages then. shred is available without install fwiw [12:35:11] vgutierrez: ok, thanks, i'll go ahead then [12:35:14] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [12:35:15] thx [12:36:04] cp1099 was initially meant to replace cp1008, but eventually pinkunicorn was discarded completely [12:36:20] that is why it has the weird status i guess [12:36:48] it is down. so would boot it just to delete the files above [12:37:00] and then shut it down for real with the cookbook [12:38:10] there's no need to delete these files, they are all wiped anyway? [12:38:14] (03PS1) 10Vgutierrez: site: Reimage cp2031 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/584588 (https://phabricator.wikimedia.org/T248816) [12:38:58] and these days disks are also shredded, which adds a second safety net [12:39:14] (physically when out of service) [12:39:31] yeah.. I'm aware i'm being paranoid :) [12:39:45] but our TLS unified material is kinda a big deal :) [12:40:45] and these days with COVID-19 restrictions in our lovely DCops team, I don't know how many days could pass with the system decomm'ed till it's actually removed and the disk wiped [12:40:59] hmm.. i can't connect to mgmt on it [12:41:12] weird lifecycle state [12:42:00] yea.. so the ticket claims all checkboxes are done [12:42:05] but it's also in site and DNS [12:42:12] fine with me :-) [12:42:40] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: Remove support for < buster [puppet] - 10https://gerrit.wikimedia.org/r/584382 (owner: 10Muehlenhoff) [12:42:59] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Vgutierrez) a:05Vgutierrez→03Papaul [12:43:35] vgutierrez: i'm afraid it's a cycle now. can't actually delete the files unless dc-ops brings it back up. or i can clean up puppet and DNS and that's it [12:44:09] nah, clean the DNS entries [12:44:18] ok [12:44:29] (03PS3) 10Gergő Tisza: Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) [12:44:31] (03PS7) 10Gergő Tisza: Deploy GrowthExperiments on Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) [12:44:33] (03PS7) 10Gergő Tisza: Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) [12:44:35] (03PS6) 10Gergő Tisza: Enable GrowthExperiments on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) [12:44:37] (03PS5) 10Gergő Tisza: Remove no-op GrowthExperiments beta settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584184 [12:44:46] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2031 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/584588 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [12:45:04] (03CR) 10Dzahn: [C: 03+2] site/DHCP: remove cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/580087 (https://phabricator.wikimedia.org/T229586) (owner: 10Dzahn) [12:45:21] mutante: don't let my paranoia poison you ;P [12:46:48] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2031.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [12:46:54] vgutierrez: heh, it's usually a good thing but this is too late [12:47:35] (03CR) 10Volans: [C: 03+1] "LGTM. Just FYI the cumin build for buster is a bit more work than I expected, I think I'd go for releasing the 4.0.0 with buster support, " [puppet] - 10https://gerrit.wikimedia.org/r/584587 (owner: 10Muehlenhoff) [12:47:37] actually.. it wouldn't be a bad idea to use encryption at rest for /etc [12:48:20] but of course.. easier said than done :) [12:49:26] yea, manually creating a LUKS volume in the installer is easier than writing partman recipes.. if they could do it [12:50:28] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2005 [puppet] - 10https://gerrit.wikimedia.org/r/584590 (https://phabricator.wikimedia.org/T248848) [12:51:04] mutante: yup, but it would require human interaction every time a server gets restarted [12:51:27] yea.. [12:51:50] kind of like "keyholder arm" is needed on deployment servers [12:51:57] indeed [12:52:38] (03CR) 10Volans: [C: 03+1] "I didn't check if there is any jessie cumin master in WMCS though, I hope not :)" [puppet] - 10https://gerrit.wikimedia.org/r/584587 (owner: 10Muehlenhoff) [12:53:09] moritzm: but that's not true ^^^ ;) [12:53:17] af-puppetmaster02 is jessie [12:53:28] !log depool & decommission cp2005 - T248848 [12:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:34] T248848: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 [12:53:35] (03PS3) 10Dzahn: site/DHCP: remove cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/580087 (https://phabricator.wikimedia.org/T229586) [12:53:42] that task for af-puppetmaster has been pinged for months, is that still not done? [12:54:20] (03CR) 10Dzahn: [C: 03+2] "per IRC" [puppet] - 10https://gerrit.wikimedia.org/r/580087 (https://phabricator.wikimedia.org/T229586) (owner: 10Dzahn) [12:54:40] I just login, so yeah I'd say not yet. I think the last update was that should be done in the next couple of weeks [12:55:29] is puppet itself working on that old puppetmaster? [12:55:39] yes [12:55:50] AFAICT, but didn't look in depth [12:56:00] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [12:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:20] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2005 [puppet] - 10https://gerrit.wikimedia.org/r/584590 (https://phabricator.wikimedia.org/T248848) (owner: 10Vgutierrez) [12:56:35] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [12:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:41] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2005.codfw.wmnet` - cp2005.codfw.wmnet (**PASS**) -... [12:56:56] (03CR) 10ArielGlenn: "I want to keep the old dumpsdata recipe around for new hosts, since it should format the data array. The recipe in use for hosts currently" [puppet] - 10https://gerrit.wikimedia.org/r/584559 (owner: 10Filippo Giunchedi) [12:58:00] hmm *nod*. maybe it is fine to break cumin if it doesn't break the rest of the puppet master. kind of doubt they rely on cumin [12:59:07] that puppetmaster is also the test env for cumin, but yeah I don't need that on jessine anymore [12:59:11] I need the new one on buster [13:00:23] make a new one with buster in the same project and go ahead with the merge? [13:00:52] (03CR) 10Filippo Giunchedi: prometheus: add http redirects for the default instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584545 (owner: 10Filippo Giunchedi) [13:00:54] (03PS2) 10Filippo Giunchedi: prometheus: add http redirects for the default instance [puppet] - 10https://gerrit.wikimedia.org/r/584545 [13:01:04] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add http redirects for the default instance [puppet] - 10https://gerrit.wikimedia.org/r/584545 (owner: 10Filippo Giunchedi) [13:01:28] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: break rules into logical groups [puppet] - 10https://gerrit.wikimedia.org/r/584541 (owner: 10Filippo Giunchedi) [13:02:00] (03PS1) 10Vgutierrez: Remove cp2005 entries [dns] - 10https://gerrit.wikimedia.org/r/584593 (https://phabricator.wikimedia.org/T248848) [13:02:08] (03PS4) 10Dzahn: site/DHCP: remove cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/580087 (https://phabricator.wikimedia.org/T229586) [13:02:52] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2005 entries [dns] - 10https://gerrit.wikimedia.org/r/584593 (https://phabricator.wikimedia.org/T248848) (owner: 10Vgutierrez) [13:03:43] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10Vgutierrez) a:05Vgutierrez→03Papaul [13:04:38] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [13:06:50] (03PS2) 10Dzahn: remove production IPs for cp1099 [dns] - 10https://gerrit.wikimedia.org/r/580089 (https://phabricator.wikimedia.org/T229586) [13:06:50] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:29] (03PS1) 10Vgutierrez: site: Reimage cp2032 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584594 (https://phabricator.wikimedia.org/T248816) [13:07:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [13:07:53] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [13:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:20] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:50] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2031.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2031.codfw.wmnet'] ` [13:12:03] the decom failed for cp1099 because it has already been ran before i guess [13:12:28] confirmed it's gone from icinga and debmonitor. changed state in netbox to decom [13:12:54] this was in some ghost state.. removing IPs [13:13:14] (03CR) 10Jcrespo: "None of these are used by me or Manuel, to the best of my knowledge." [puppet] - 10https://gerrit.wikimedia.org/r/584559 (owner: 10Filippo Giunchedi) [13:13:22] (03CR) 10Dzahn: [C: 03+2] remove production IPs for cp1099 [dns] - 10https://gerrit.wikimedia.org/r/580089 (https://phabricator.wikimedia.org/T229586) (owner: 10Dzahn) [13:14:28] mutante: https://phabricator.wikimedia.org/T229586#5827557 [13:15:00] volans: ah:) yes, thanks [13:15:29] also removing the mgmt DNS entry but keeping the asset tag mgmt [13:15:58] (03PS1) 10Jbond: tomcatversion: correct tomcat version to 9.0.30 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/584595 (https://phabricator.wikimedia.org/T246010) [13:18:26] (03CR) 10Jcrespo: [C: 04-1] "Actually, raid1-lvm-ext4-srv-plus-hwraid.cfg is in use- see comments on ff9d6af1ac1f07d486056e917c4facb55650a9ea It will be used again soo" [puppet] - 10https://gerrit.wikimedia.org/r/584559 (owner: 10Filippo Giunchedi) [13:19:48] (03PS2) 10Dzahn: remove mgmt IPs for cp1099 [dns] - 10https://gerrit.wikimedia.org/r/580091 (https://phabricator.wikimedia.org/T229586) [13:20:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/584595 (https://phabricator.wikimedia.org/T246010) (owner: 10Jbond) [13:20:48] ok, that was strange. mgmt IP was removed but production IP was not [13:21:00] oh well, done [13:21:13] (03CR) 10Jbond: [V: 03+2 C: 03+2] tomcatversion: correct tomcat version to 9.0.30 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/584595 (https://phabricator.wikimedia.org/T246010) (owner: 10Jbond) [13:21:19] (03PS1) 10Muehlenhoff: Temporarily disable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/584598 [13:22:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/584598 (owner: 10Muehlenhoff) [13:23:33] (03PS1) 10Jcrespo: partman: Reference bacula recipe (unused to prevent accidental formatting) [puppet] - 10https://gerrit.wikimedia.org/r/584599 (https://phabricator.wikimedia.org/T138562) [13:23:35] (03Abandoned) 10Dzahn: remove mgmt IPs for cp1099 [dns] - 10https://gerrit.wikimedia.org/r/580091 (https://phabricator.wikimedia.org/T229586) (owner: 10Dzahn) [13:25:04] (03CR) 10Dzahn: [C: 03+2] ATS: switch iegreview.wm.org to miscweb1002 backend [puppet] - 10https://gerrit.wikimedia.org/r/584583 (https://phabricator.wikimedia.org/T247648) (owner: 10Dzahn) [13:25:12] (03PS2) 10Dzahn: ATS: switch iegreview.wm.org to miscweb1002 backend [puppet] - 10https://gerrit.wikimedia.org/r/584583 (https://phabricator.wikimedia.org/T247648) [13:25:25] (03CR) 10Jcrespo: [C: 04-1] "Let me know if this is agreeable instead: https://gerrit.wikimedia.org/r/c/operations/puppet/+/584599" [puppet] - 10https://gerrit.wikimedia.org/r/584559 (owner: 10Filippo Giunchedi) [13:26:27] (03PS2) 10Jcrespo: partman: Reference bacula recipe (unused to prevent accidental formatting) [puppet] - 10https://gerrit.wikimedia.org/r/584599 (https://phabricator.wikimedia.org/T138562) [13:26:32] (03CR) 10Vgutierrez: [C: 03+1] ATS: unset debug HTTP headers for normal requests [puppet] - 10https://gerrit.wikimedia.org/r/583570 (https://phabricator.wikimedia.org/T210484) (owner: 10Ema) [13:27:18] (03CR) 10Ema: [C: 03+2] ATS: unset debug HTTP headers for normal requests [puppet] - 10https://gerrit.wikimedia.org/r/583570 (https://phabricator.wikimedia.org/T210484) (owner: 10Ema) [13:27:28] (03PS15) 10L0st3xpl0r3r: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) [13:28:28] (03PS16) 10Jcrespo: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [13:28:35] (03PS3) 10CDanis: phased rollout of sensible flow-table-sizes [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) [13:28:40] (03CR) 10Jcrespo: "Forcing jenkins recheck." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [13:28:50] (03CR) 10CDanis: phased rollout of sensible flow-table-sizes (036 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [13:28:52] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [13:28:57] (03CR) 10Marostegui: [C: 03+1] "Same as the one we have for db999" [puppet] - 10https://gerrit.wikimedia.org/r/584599 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:29:23] (03CR) 10Muehlenhoff: [C: 03+2] Temporarily disable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/584598 (owner: 10Muehlenhoff) [13:29:30] (03PS2) 10Muehlenhoff: Temporarily disable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/584598 [13:31:07] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2032 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584594 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [13:31:23] (03PS2) 10Vgutierrez: site: Reimage cp2032 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584594 (https://phabricator.wikimedia.org/T248816) [13:31:54] (03CR) 10Filippo Giunchedi: prometheus: add recording rules for disk activity (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584542 (owner: 10Filippo Giunchedi) [13:32:03] (03PS2) 10Filippo Giunchedi: prometheus: add recording rules for disk activity [puppet] - 10https://gerrit.wikimedia.org/r/584542 [13:32:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ssh-key-ldap-lookup: port to python3 [puppet] - 10https://gerrit.wikimedia.org/r/584211 (owner: 10Andrew Bogott) [13:32:45] (03CR) 10Jcrespo: "> Patch Set 16: Verified-1" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [13:32:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] neutron: enable l3_agent_only_dmz_cidr_hack in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/584188 (https://phabricator.wikimedia.org/T247505) (owner: 10Andrew Bogott) [13:33:27] vgutierrez: feel free to "multiple" [13:33:30] mutante: may I merge Dzahn: ATS: switch iegreview.wm.org to miscweb1002 backend (5f48a968c5)? :) [13:33:31] ack [13:33:48] yea, i'll also just let cp hosts run puppet naturally [13:33:49] (03PS3) 10Jcrespo: partman: Reference bacula recipe (unused to prevent accidental formatting) [puppet] - 10https://gerrit.wikimedia.org/r/584599 (https://phabricator.wikimedia.org/T138562) [13:34:00] ema: ^^ another remap config change for ats-backend [13:34:45] ema: that and 584581 as well today [13:34:52] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2032.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [13:34:54] vgutierrez: should i ping each time i do that? [13:35:20] yeah.. the other one segfaulted ats-backend in two servers [13:35:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge-k8s: mount /var/lib/docker on appropriate volume [puppet] - 10https://gerrit.wikimedia.org/r/584061 (https://phabricator.wikimedia.org/T248702) (owner: 10Bstorm) [13:35:27] that's why I'm pinging ema about it [13:35:51] oooh.. but not because there was wrong syntax? [13:35:56] nope mutante [13:35:58] ok [13:36:17] it's a bug [13:37:27] ok. also i stopped touching any varnish files. though sometimes text.yaml still pops up [13:38:49] (03CR) 10Kosta Harlan: [C: 03+1] Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) (owner: 10Gergő Tisza) [13:38:51] (03PS4) 10Arturo Borrero Gonzalez: Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [13:39:25] (03CR) 10jerkins-bot: [V: 04-1] Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [13:40:28] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:41:40] 10Operations, 10serviceops, 10Patch-For-Review: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [13:41:43] (03PS5) 10Arturo Borrero Gonzalez: Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [13:42:12] (03CR) 10jerkins-bot: [V: 04-1] Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [13:42:20] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/584559 (owner: 10Filippo Giunchedi) [13:43:23] (03PS2) 10Giuseppe Lavagetto: Update envoy, add ability to define an idle timeout [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580906 [13:44:28] (03CR) 10CDanis: [C: 03+1] prometheus: add recording rules for disk activity (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584542 (owner: 10Filippo Giunchedi) [13:44:44] (03PS6) 10Arturo Borrero Gonzalez: Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [13:44:55] (03CR) 10Jcrespo: [C: 03+2] partman: Reference bacula recipe (unused to prevent accidental formatting) [puppet] - 10https://gerrit.wikimedia.org/r/584599 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:45:46] !log pool cp2031 - T248816 [13:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:52] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [13:46:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This is ready for review & merge." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [13:47:14] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Vgutierrez) [13:47:52] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [13:49:11] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/584559 (owner: 10Filippo Giunchedi) [13:49:31] (03PS17) 10L0st3xpl0r3r: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) [13:51:03] (03PS2) 10Filippo Giunchedi: install_server: add notes for dumps/backup recipes [puppet] - 10https://gerrit.wikimedia.org/r/584559 (https://phabricator.wikimedia.org/T156955) [13:51:49] (03CR) 10L0st3xpl0r3r: "> Patch Set 16:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [13:52:10] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add recording rules for disk activity [puppet] - 10https://gerrit.wikimedia.org/r/584542 (owner: 10Filippo Giunchedi) [13:52:38] (03CR) 10ArielGlenn: [C: 03+1] "Good for me for the dumps-related change." [puppet] - 10https://gerrit.wikimedia.org/r/584559 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [13:52:44] (03CR) 10L0st3xpl0r3r: "> Patch Set 17:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [13:53:16] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2006 [puppet] - 10https://gerrit.wikimedia.org/r/584605 (https://phabricator.wikimedia.org/T248856) [13:53:51] (03CR) 10Vgutierrez: [C: 03+2] gerrit::proxy: Switch to strong SSL settings [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [13:55:12] oh, nice [13:55:34] (03PS2) 10Ema: cache: stop sending X-Varnish [puppet] - 10https://gerrit.wikimedia.org/r/583942 (https://phabricator.wikimedia.org/T210484) [13:56:02] paladox: ^ gerrit strong SSL patch merged [13:56:14] PROBLEM - traffic_server backend process restarted on cp1081 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqiad+prometheus/ops&var-instance=cp1081&var-layer=backend [13:57:12] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/584559 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [13:57:54] (03PS1) 10Dzahn: switch webserver-misc-apps discovery record to miscweb1002 [dns] - 10https://gerrit.wikimedia.org/r/584606 (https://phabricator.wikimedia.org/T247648) [13:57:58] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:41] (03PS18) 10Jcrespo: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [13:59:49] (03CR) 10Jcrespo: "recheck" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:00:23] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2032.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2032.codfw.wmnet'] ` [14:01:05] (03PS1) 10Dzahn: ATS: use discovery name instead of miscweb again after migration [puppet] - 10https://gerrit.wikimedia.org/r/584607 (https://phabricator.wikimedia.org/T247648) [14:01:13] !log depool & decommission cp2006 - T248856 [14:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:19] T248856: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 [14:01:19] PROBLEM - traffic_server backend process restarted on cp2023 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2023&var-layer=backend [14:01:24] ema: ^^ [14:02:09] so far all the crashes have been in codfw? [14:02:20] vgutierrez: mostly, yeah [14:02:33] cp1081 also crashed, but that's the good old tslua reload problem [14:02:38] ack [14:02:46] it looks like the same [14:03:00] the lua plugin is involved in this stack trace as well.. [14:03:10] (03CR) 10Jbond: phased rollout of sensible flow-table-sizes (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [14:03:19] 10Operations, 10Gerrit, 10Traffic, 10HTTPS, 10Security: Harden apache SSL/TLS settings for gerrit.wikimedia.org - https://phabricator.wikimedia.org/T221499 (10Reedy) [14:03:20] the same or related at least [14:03:52] vgutierrez: I'm not sure, there's no lua panic in the segfault case [14:03:59] 10Operations, 10Gerrit, 10Traffic, 10HTTPS, 10Security: Harden apache SSL/TLS settings for gerrit.wikimedia.org - https://phabricator.wikimedia.org/T221499 (10Reedy) 05Open→03Resolved a:03Krenair [14:04:40] (03CR) 10Ema: [C: 03+2] cache: stop sending X-Varnish [puppet] - 10https://gerrit.wikimedia.org/r/583942 (https://phabricator.wikimedia.org/T210484) (owner: 10Ema) [14:04:42] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:05:30] (03CR) 10Jcrespo: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/584559 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [14:05:57] ema: nope, but it's crashing on ts_lua.cc @ TSRemapDeleteInstance [14:06:05] mutante thanks! [14:06:22] ema: that's triggered on config reload [14:07:06] see https://github.com/apache/trafficserver/blob/8.0.6/plugins/lua/ts_lua.c#L167-L176 [14:07:36] vgutierrez: it's not always crashing there tho, see https://phabricator.wikimedia.org/P10814 [14:07:51] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:08:19] vgutierrez: so yes there's definitely something wrong with tslua upon reloads, but I'm not convinced that the panics and the segfaults are the same issue [14:08:28] so sometimes crashes on the stop part of the tslua and some times on the start part [14:08:31] funny :) [14:09:03] ema: I'm wondering if this it's related with the global reload option [14:09:12] *related to [14:10:18] our lovely --enable-reload [14:10:39] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:10:53] (03CR) 10Krinkle: Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) (owner: 10Gergő Tisza) [14:11:29] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 22393 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:11:50] vgutierrez: possibly! [14:13:17] 10Operations, 10fundraising-tech-ops, 10observability, 10User-fgiunchedi: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi With https://gerrit.wikimedia.org/r/580985 merged I'm resolving this task since che... [14:14:26] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2006 [puppet] - 10https://gerrit.wikimedia.org/r/584605 (https://phabricator.wikimedia.org/T248856) (owner: 10Vgutierrez) [14:14:50] (03CR) 10RLazarus: [C: 04-1] "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/584020 (owner: 10RLazarus) [14:15:49] (03PS19) 10Jcrespo: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:15:58] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Update to 5.0.41 templates [software/otrs] - 10https://gerrit.wikimedia.org/r/584549 (owner: 10Alexandros Kosiaris) [14:16:17] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:17:11] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [14:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:43] (03PS20) 10Jcrespo: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:17:49] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:55] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2006.codfw.wmnet` - cp2006.codfw.wmnet (**PASS**) -... [14:19:08] 10Operations, 10observability: Move prometheus entry point off port 80 - https://phabricator.wikimedia.org/T152445 (10fgiunchedi) 05Open→03Invalid We're moving Prometheus on its own dedicated hosts everywhere, I see no reason not to leave the current entry point as is now (also we moved to apache in the me... [14:19:44] (03PS21) 10L0st3xpl0r3r: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) [14:20:17] (03CR) 10Jcrespo: "yay." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:21:01] (03CR) 10L0st3xpl0r3r: "> Patch Set 18:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:22:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Add 2 new CPAN dependencies [puppet] - 10https://gerrit.wikimedia.org/r/584557 (https://phabricator.wikimedia.org/T248814) (owner: 10Alexandros Kosiaris) [14:22:38] (03PS1) 10Vgutierrez: Remove cp2006 entries [dns] - 10https://gerrit.wikimedia.org/r/584608 (https://phabricator.wikimedia.org/T248856) [14:22:45] (03PS22) 10Jcrespo: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:23:00] (03CR) 10Jcrespo: "recheck" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:23:04] !log pool cp2032 - T248816 [14:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:13] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [14:24:04] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Dzahn) a:05Jclark-ctr→03Cmjohnson Hi Chris, can this be fixed from remote? The host is in an odd state. It exists but we can't SSH to it or use install... [14:24:15] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2006 entries [dns] - 10https://gerrit.wikimedia.org/r/584608 (https://phabricator.wikimedia.org/T248856) (owner: 10Vgutierrez) [14:25:12] (03CR) 10L0st3xpl0r3r: "> Patch Set 20:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:25:27] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Vgutierrez) a:05Vgutierrez→03Papaul [14:25:57] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [14:26:26] (03CR) 10L0st3xpl0r3r: "> Patch Set 22:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:29:54] (03PS1) 10Gergő Tisza: Document the process of updating dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584610 [14:30:15] (03PS1) 10Vgutierrez: site: Reimage cp2033 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/584611 (https://phabricator.wikimedia.org/T248816) [14:30:44] (03PS1) 10Dzahn: add planet1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/584612 (https://phabricator.wikimedia.org/T247651) [14:33:58] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2033 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/584611 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [14:34:15] 10Operations, 10vm-requests: Site: EQIAD/CODFW 2 VM request for planet - https://phabricator.wikimedia.org/T248863 (10Dzahn) [14:34:38] 10Operations, 10serviceops, 10Patch-For-Review: upgrade planet.wikimedia.org backends to buster - https://phabricator.wikimedia.org/T247651 (10Dzahn) [14:34:40] 10Operations, 10vm-requests: Site: EQIAD/CODFW 2 VM request for planet - https://phabricator.wikimedia.org/T248863 (10Dzahn) [14:35:16] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2033.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [14:36:01] (03PS1) 10Ema: conftool::scripts: add pooled [puppet] - 10https://gerrit.wikimedia.org/r/584613 [14:37:49] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Vgutierrez) [14:38:40] 10Operations, 10Analytics, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) [14:38:58] XioNoX, cdanis --^ [14:39:25] nothing terribly urgent but we should do it next quarter if possible [14:41:58] elukey: ack, sounds good [14:42:18] elukey: would we do the adding of router + interface data before this? (not sure of the task# for that) [14:42:46] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2008 [puppet] - 10https://gerrit.wikimedia.org/r/584614 (https://phabricator.wikimedia.org/T248864) [14:42:55] cdanis: ah yes I can work on that sooner if needed, but data should already be in hive [14:43:05] (03PS23) 10Jcrespo: transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:44:09] (03CR) 10Jcrespo: "What would you think of this-- it is your same patch, but with reduced cyclomatic complexity?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:46:47] 10Operations, 10Patch-For-Review, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10RLazarus) Just snapshot1006 left, it's on my list for this morning. [14:50:13] 10Operations, 10Patch-For-Review, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10RLazarus) 05Open→03Resolved And done. [14:50:30] (03CR) 10Bstorm: [C: 03+2] "Oops, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/584102 (https://phabricator.wikimedia.org/T248731) (owner: 10Zhuyifei1999) [14:52:43] (03CR) 10L0st3xpl0r3r: "> Patch Set 23:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [14:53:52] !log depool & decommission cp2008 - T248864 [14:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:57] T248864: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 [14:54:08] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2008 [puppet] - 10https://gerrit.wikimedia.org/r/584614 (https://phabricator.wikimedia.org/T248864) (owner: 10Vgutierrez) [14:55:44] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10AMooney) a:03holger.knust [14:55:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:55] (03CR) 10RLazarus: [C: 03+1] check_opcache: Use the number of scripts to determine threshold (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583906 (owner: 10Giuseppe Lavagetto) [14:57:14] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [14:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:48] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:53] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2008.codfw.wmnet` - cp2008.codfw.wmnet (**PASS**) -... [14:57:56] (03PS1) 10Zoranzoki21: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) [14:58:26] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:22] (03PS1) 10Vgutierrez: Remove cp2008 entries [dns] - 10https://gerrit.wikimedia.org/r/584616 (https://phabricator.wikimedia.org/T248864) [14:59:38] (03PS2) 10Zoranzoki21: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) [14:59:51] (03PS2) 10Ema: conftool::scripts: add pooled [puppet] - 10https://gerrit.wikimedia.org/r/584613 (https://phabricator.wikimedia.org/T248067) [15:00:19] (03CR) 10Alexandros Kosiaris: "@ottomata, yes, I think we are finally ready. For this and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/562792/. Both need some " [puppet] - 10https://gerrit.wikimedia.org/r/562810 (https://phabricator.wikimedia.org/T241073) (owner: 10Alexandros Kosiaris) [15:01:01] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2033.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2033.codfw.wmnet'] ` [15:03:13] (03CR) 10Jcrespo: [C: 03+1] "Good work!" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [15:03:59] elukey: ah cool, no worries then [15:04:27] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2008 entries [dns] - 10https://gerrit.wikimedia.org/r/584616 (https://phabricator.wikimedia.org/T248864) (owner: 10Vgutierrez) [15:05:50] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:06:15] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [15:06:32] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10Papaul) ` [edit interfaces interface-range disabled] member xe-7/0/46 { ... } + member xe-2/0/3; [edit interfaces] - xe-2/0/3 { - description cp20... [15:06:57] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10Papaul) [15:07:18] (03CR) 10Hashar: [C: 03+1] "Thx :)" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [15:08:32] (03CR) 10RhinosF1: [C: 03+1] "LGTM, we should update the Configuration files help guide on Wikitech per /Ve-enable is my user space as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584610 (owner: 10Gergő Tisza) [15:08:48] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/3 { ... } + member xe-2/0/5; [edit interfaces] - xe-2/0/5 { - description cp200... [15:08:51] tgr: ^ [15:09:09] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10Papaul) [15:09:37] 10Operations, 10MediaWiki-Parser, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-Incident: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10AMooney) a:03nnikkhoui [15:10:25] (03CR) 10Andrew Bogott: [C: 03+2] ssh-key-ldap-lookup: port to python3 [puppet] - 10https://gerrit.wikimedia.org/r/584211 (owner: 10Andrew Bogott) [15:11:30] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/5 { ... } + member xe-7/0/3; [edit interfaces] - xe-7/0/3 { - description cp2004; - enab... [15:11:57] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Papaul) [15:12:09] (03CR) 10Marostegui: [C: 03+1] "Thank you!!" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [15:13:39] (03PS1) 10KartikMistry: Update cxserver to 2020-03-30-145349-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/584618 (https://phabricator.wikimedia.org/T248578) [15:14:40] (03CR) 10Jcrespo: [C: 03+2] transfer.py: Convert return for run() from int to list [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/583960 (https://phabricator.wikimedia.org/T248661) (owner: 10L0st3xpl0r3r) [15:16:27] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10Papaul) ` [edit interfaces interface-range disabled] member xe-7/0/3 { ... } + member xe-7/0/5; [edit interfaces] - xe-7/0/5{ - description cp2005; - enabl... [15:16:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] Make configuration of envoy a ConfigMap (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/582777 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:16:53] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10Papaul) [15:24:56] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Papaul) @Vgutierrez this is still active in Netbox [15:25:20] !log add icinga 2h downtime and soft reset iDRAC on labstore1005.mgmt.eqiad.wmnet T247965 [15:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:25] T247965: labstore1005 mgmt console unreachable via SSH - https://phabricator.wikimedia.org/T247965 [15:25:37] (03PS1) 10Jcrespo: transfer.py: Upgrade codebase to latest version on HEAD [puppet] - 10https://gerrit.wikimedia.org/r/584620 [15:26:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] logspam-watch: don't use fatal.log; shorten names [puppet] - 10https://gerrit.wikimedia.org/r/582871 (https://phabricator.wikimedia.org/T248337) (owner: 10Brennen Bearnes) [15:28:12] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Papaul) ` [edit interfaces interface-range disabled] member xe-7/0/5 { ... } + member xe-7/0/6; [edit interfaces] - xe-7/0/6 { - description cp2006; - enab... [15:28:37] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Vgutierrez) hmmm @Volans apparently cp2008 is still active in netbox but the cookbook logged `Set Netbox status to Decommissioning` [15:28:56] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Papaul) [15:30:21] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Vgutierrez) @Papaul in https://netbox.wikimedia.org/dcim/devices/679/ is marked as "Decommissioning" not "active" [15:32:31] !log pool cp2033 - T248816 [15:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:38] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [15:34:08] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [15:35:26] 10Operations, 10Puppet, 10Traffic, 10serviceops: Puppet systemd::mask is an anti pattern that has unwanted side effect - https://phabricator.wikimedia.org/T233839 (10hashar) 05Stalled→03Invalid [15:35:31] (03PS3) 10Filippo Giunchedi: install_server: add notes for dumps/backup recipes [puppet] - 10https://gerrit.wikimedia.org/r/584559 (https://phabricator.wikimedia.org/T156955) [15:35:35] (03CR) 10Filippo Giunchedi: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/584559 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:35:47] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/46 { ... } + member xe-2/0/5; [edit interfaces] - xe-2/0/5 { - description cp2008; - ena... [15:36:14] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Papaul) [15:40:01] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10herron) [15:40:45] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] README.md: update docker-pkg command line [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580124 (owner: 10Hashar) [15:40:51] (03CR) 10Jcrespo: [C: 03+1] install_server: add notes for dumps/backup recipes [puppet] - 10https://gerrit.wikimedia.org/r/584559 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:41:41] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2008.codfw.wmnet - https://phabricator.wikimedia.org/T248864 (10Papaul) [15:42:27] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Papaul) [15:43:51] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10Papaul) [15:45:40] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Papaul) [15:48:11] 10Operations, 10Analytics, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10ayounsi) So the options are: * add a plugin to pmacct (eg. https://github.com/pierky/pmacct-to-elasticsearch) * replace pmacct with something that can do HTTP POST * insert something b... [15:50:49] 10Operations: puppet-merge manual locking - https://phabricator.wikimedia.org/T248872 (10CDanis) [15:51:06] 10Operations: puppet-merge lockout/tagout - https://phabricator.wikimedia.org/T248872 (10CDanis) p:05Triage→03Medium [15:51:08] 10Operations, 10Analytics, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) Hm, if you just want to get the Refine side working, all you really need is for the data in Kafka to look right. EventGate gets you schema validation (and maybe a couple of... [15:53:47] (03PS1) 10Andrew Bogott: ssh-key-ldap-lookup: keep a python2 version for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/584624 [15:57:26] (03CR) 10jerkins-bot: [V: 04-1] ssh-key-ldap-lookup: keep a python2 version for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/584624 (owner: 10Andrew Bogott) [15:58:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ssh-key-ldap-lookup: keep a python2 version for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/584624 (owner: 10Andrew Bogott) [15:59:21] (03PS2) 10Andrew Bogott: ssh-key-ldap-lookup: keep a python2 version for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/584624 [15:59:53] 10Operations, 10Analytics, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) >>! In T248865#6011075, @Ottomata wrote: > Hm, if you just want to get the Refine side working, all you really need is for the data in Kafka to look right. EventGate gets you... [16:03:58] (03CR) 10Andrew Bogott: [C: 03+2] ssh-key-ldap-lookup: keep a python2 version for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/584624 (owner: 10Andrew Bogott) [16:06:41] 10Puppet, 10Patch-For-Review, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10User-brennen: logspam-watch: fatal.log no longer exists - https://phabricator.wikimedia.org/T248337 (10brennen) 05Open→03Resolved [16:17:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Correct me if I am wrong, but this requires buster, right? systemd on stretch+buster supports it but jessie doesn't, so to start with, pro" [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [16:22:15] (03CR) 10Bstorm: [C: 03+2] toolforge: remove the old docker builder code [puppet] - 10https://gerrit.wikimedia.org/r/584059 (https://phabricator.wikimedia.org/T248703) (owner: 10Bstorm) [16:32:31] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:39] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:55] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:03] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:11] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:11] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:13] this is eventstreams --^ [16:33:21] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:29] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:33] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:36] but that should be on k8s [16:33:45] so this is probably a cleanup? [16:33:53] akosiaris: --^ [16:34:01] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:34] yes [16:36:50] elukey: yup, but it's weird that it's alerting, it shouldn't, /me looking [16:37:18] Code Deployment Office Hour is happening in 25 minutes in #wikimedia-office on IRC [16:38:01] (03CR) 10Jforrester: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583906 (owner: 10Giuseppe Lavagetto) [16:40:07] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:15] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:23] ah, forgot to run systemctl reset-failed [16:40:31] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:39] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:41] despite having removed the unit file AND ran systemctl daemon-reload [16:40:49] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:49] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:50] super, just wanted to triple check :) [16:40:51] it still keeps the service in state [16:40:56] (03CR) 10Andrew Bogott: [C: 03+2] neutron: enable l3_agent_only_dmz_cidr_hack in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/584188 (https://phabricator.wikimedia.org/T247505) (owner: 10Andrew Bogott) [16:40:57] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:05] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:09] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:37] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:05] (03CR) 10Brennen Bearnes: [C: 03+1] "It's been a while since I looked at CSP, but this seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/582604 (https://phabricator.wikimedia.org/T245658) (owner: 10Brian Wolff) [16:46:01] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10Cmjohnson) [16:49:20] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10elukey) [16:49:26] (03CR) 10Jforrester: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/583906 (owner: 10Giuseppe Lavagetto) [16:49:32] (03CR) 10Jforrester: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/584020 (owner: 10RLazarus) [16:49:43] (03PS1) 10Hnowlan: changeprop: Make service features toggles rather than comments [deployment-charts] - 10https://gerrit.wikimedia.org/r/584637 (https://phabricator.wikimedia.org/T248677) [16:49:44] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10elukey) After T245810 we don't really need a custom partman recipe anymore, let's use partman/raid10-4dev.cfg (already configured in puppet for... [16:51:39] (03PS1) 10Jbond: tomcat: create new tomcat module intended for use with apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/584638 (https://phabricator.wikimedia.org/T233950) [16:53:01] (03CR) 10jerkins-bot: [V: 04-1] tomcat: create new tomcat module intended for use with apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/584638 (https://phabricator.wikimedia.org/T233950) (owner: 10Jbond) [16:53:43] (03CR) 10Jforrester: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/584020 (owner: 10RLazarus) [16:54:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/584637 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [16:54:52] James_F: thank you! [16:55:09] rlazarus: My pleasure. [16:55:22] (03PS2) 10Jbond: tomcat: create new tomcat module intended for use with apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/584638 (https://phabricator.wikimedia.org/T233950) [17:00:04] gehel and onimisionipe: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200330T1700). [17:02:30] (03CR) 10Ppchelko: "Did you test that if you enable all of these change-prop actually starts?" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/584637 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [17:05:15] (03CR) 10Jcrespo: [C: 03+2] transfer.py: Upgrade codebase to latest version on HEAD [puppet] - 10https://gerrit.wikimedia.org/r/584620 (owner: 10Jcrespo) [17:14:17] (03CR) 10Krinkle: Document the process of updating dblists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584610 (owner: 10Gergő Tisza) [17:15:27] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:17:23] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:21:56] (03CR) 10Herron: "LGTM, pending follow-up re: profile::kibana::httpd_proxy vs profile::kibana_httpd_proxy. The former is preferable to me as well, but msty" [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [17:23:09] (03PS3) 10Cparle: Enable WikibaseQualityConstraints on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581678 (https://phabricator.wikimedia.org/T248117) [17:28:31] 10Operations, 10Cloud-VPS, 10Epic, 10IPv6, 10cloud-services-team (Kanban): Enable IPv6 on CloudVPS - https://phabricator.wikimedia.org/T37947 (10aborrero) CloudVPS now uses a version of openstack that fully supports IPv6. Research/PoC work on IPv6 can be seen at {T245495} [17:32:41] 10Operations, 10Graphoid, 10Code-Stewardship-Reviews, 10Release-Engineering-Team (Code Health), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10kaldari) @Milimetric - If you are working on migrating Graphoid to Node 10, please create a Phabricator task to tra... [17:41:22] 10Operations, 10Performance-Team: MW Memcached get hit ratio trend over the past months - https://phabricator.wikimedia.org/T248890 (10elukey) [17:44:09] 10Operations, 10Analytics, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) Indeed I need some better docs on this, and will be writing them as part of the EventLogging migration. Devs will need to be able to figure this out easily. But, some start... [17:45:21] (03PS1) 10Ppchelko: Remove outdated PCS endpoint references [deployment-charts] - 10https://gerrit.wikimedia.org/r/584660 [17:45:51] (03CR) 10Jforrester: [C: 04-1] "Yeah, what Timo said." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584610 (owner: 10Gergő Tisza) [17:46:22] (03CR) 10RLazarus: [C: 03+2] systemd: Replace the Datetime regex with a call to systemd-analyze. [puppet] - 10https://gerrit.wikimedia.org/r/584020 (owner: 10RLazarus) [17:48:47] (03CR) 10Jforrester: "recheck" [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/580944 (owner: 10Ssingh) [17:52:26] (03CR) 10Ori.livneh: "Thanks Chris! Are you able to merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/577652 (owner: 10Ori.livneh) [17:53:32] (03PS1) 10Cmjohnson: Add an-druid/druid100[78] to dhcp and netboot.cfg files [puppet] - 10https://gerrit.wikimedia.org/r/584666 (https://phabricator.wikimedia.org/T245569) [17:57:18] James_F: thanks! completed successfully [17:58:09] (03PS1) 10Ppchelko: Eventgate-main: add mediawiki/page-suppress stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/584667 (https://phabricator.wikimedia.org/T242025) [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200330T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:07] (03CR) 10Ottomata: [C: 03+1] Eventgate-main: add mediawiki/page-suppress stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/584667 (https://phabricator.wikimedia.org/T242025) (owner: 10Ppchelko) [18:00:27] (03CR) 10Cmjohnson: [C: 03+2] Add an-druid/druid100[78] to dhcp and netboot.cfg files [puppet] - 10https://gerrit.wikimedia.org/r/584666 (https://phabricator.wikimedia.org/T245569) (owner: 10Cmjohnson) [18:05:00] (03PS2) 10Ssingh: Update `install_requires' in setup.py [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/580944 [18:06:35] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10bd808) >>! In T231520#6009325, @Marostegui wrote: > I have also run some queries via Quarry and I have seen them arriving correctly to labsdb1011 via the n... [18:08:10] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Cmjohnson) [18:11:03] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:11:42] (03CR) 10Ssingh: [C: 03+2] "Merging without additional review as only the CI environment was updated and no functionality has changed." [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/580944 (owner: 10Ssingh) [18:12:36] (03PS1) 10Ppchelko: Changeprop: Listen to mediawiki.page-suppress topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/584672 (https://phabricator.wikimedia.org/T242025) [18:14:47] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:21:09] (03PS1) 10Cmjohnson: Updating dhchp file for kafka-jumbo100[789] to reflect correct eth port [puppet] - 10https://gerrit.wikimedia.org/r/584673 (https://phabricator.wikimedia.org/T244506) [18:22:19] (03PS2) 10Cmjohnson: Updating dhchp file for kafka-jumbo100[789] to reflect correct eth port [puppet] - 10https://gerrit.wikimedia.org/r/584673 (https://phabricator.wikimedia.org/T244506) [18:23:15] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: mount /var/lib/docker on appropriate volume [puppet] - 10https://gerrit.wikimedia.org/r/584061 (https://phabricator.wikimedia.org/T248702) (owner: 10Bstorm) [18:23:42] (03CR) 10Cmjohnson: [C: 03+2] Updating dhchp file for kafka-jumbo100[789] to reflect correct eth port [puppet] - 10https://gerrit.wikimedia.org/r/584673 (https://phabricator.wikimedia.org/T244506) (owner: 10Cmjohnson) [18:28:42] (03Restored) 10Lucas Werkmeister (WMDE): fatalmonitor: exec watch [puppet] - 10https://gerrit.wikimedia.org/r/499761 (owner: 10Lucas Werkmeister (WMDE)) [18:30:36] (03PS4) 10Lucas Werkmeister (WMDE): logspam-watch: exec watch [puppet] - 10https://gerrit.wikimedia.org/r/499761 [18:31:00] (03CR) 10Lucas Werkmeister (WMDE): "Change resurrected for the replacement script, where the exact same improvement can be made." [puppet] - 10https://gerrit.wikimedia.org/r/499761 (owner: 10Lucas Werkmeister (WMDE)) [18:36:47] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Dwisehaupt) [18:44:34] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10holger.knust) [18:50:28] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020): Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10Aklapper) [18:50:57] (03PS1) 10Aklapper: aklapper: access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/584676 (https://phabricator.wikimedia.org/T248905) [18:54:18] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) @Pchelolo Do you have any insight on the job queue question? Thanks! [18:55:14] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Pchelolo) > Purge jobs for this wiki from the JobQueue. There's no need to do this step in WMF Kafka-based jobqueue. [18:59:35] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) [19:00:31] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10holger.knust) Suggestion for #user-notice Mediawiki is up... [19:00:36] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) [19:01:24] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) [19:16:03] (03CR) 10Hashar: "I have cherry picked it on the integration puppet master. We will see the result tomorrow ;)" [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [19:19:03] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kafka-jumbo1009.eqiad.wmnet ` The log can be fou... [19:19:44] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kafka-jumbo1008.eqiad.wmnet ` The log can be fou... [19:20:24] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kafka-jumbo1007.eqiad.wmnet ` The log can be fou... [19:21:20] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020), 10Patch-For-Review: Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10Bmueller) OK from my side! [19:23:17] (03PS2) 10Ottomata: eventstreams: Remove profile [puppet] - 10https://gerrit.wikimedia.org/r/583077 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [19:24:33] (03CR) 10Ottomata: [C: 03+2] eventstreams: Remove profile [puppet] - 10https://gerrit.wikimedia.org/r/583077 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [19:28:11] (03CR) 10Mholloway: [C: 03+1] Remove outdated PCS endpoint references [deployment-charts] - 10https://gerrit.wikimedia.org/r/584660 (owner: 10Ppchelko) [19:29:58] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` htmldumper1001.eqiad.wmnet ` The log can be foun... [19:34:59] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10Ottomata) FYI, EventStreams is fully migrated to k8s and is using Cole's service-runner prometheus exporter code. Dashboard here: https://grafana... [19:35:06] (03CR) 10Hashar: "Ayounsi can you review it please? I will be happy to detail the logic in a video call ;)" [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553741 (owner: 10Hashar) [19:39:22] 10Operations, 10Wikimedia-Mailing-lists: Creation of three Wikimedia CH mailing lists - https://phabricator.wikimedia.org/T248910 (10Ilario) [19:40:25] 10Operations, 10Performance-Team: MW Memcached get hit ratio trend over the past months - https://phabricator.wikimedia.org/T248890 (10Gilles) a:03aaron [19:47:46] (03PS11) 10RLazarus: profile::mediawiki::maintenance: Migrate pagetriage jobs to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/582933 (https://phabricator.wikimedia.org/T211250) [19:50:07] 10Operations, 10Performance-Team, 10Traffic: Review socket balancing in ATS/Varnish traffic layers - https://phabricator.wikimedia.org/T248522 (10Gilles) a:03dpifke [19:53:52] (03PS1) 10Volans: junos: retry when a timeout occurs during commits [software/homer] - 10https://gerrit.wikimedia.org/r/584689 (https://phabricator.wikimedia.org/T244363) [19:54:44] 10Operations, 10Performance-Team: MW Memcached get hit ratio trend over the past months - https://phabricator.wikimedia.org/T248890 (10aaron) The timing of this SAL entry suggests some relation: 13:15 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.16/extensions/Wikibase/lib/includes/Store/CachingPropertyIn... [19:55:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:15] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:16] 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Traffic, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) a:05Gilles→03dpifke [19:57:06] (03PS12) 10RLazarus: profile::mediawiki::maintenance: Migrate pagetriage jobs to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/582933 (https://phabricator.wikimedia.org/T211250) [19:57:29] (03PS1) 10Jbond: profile::tlsproxy::envoy: add missing hiera values [puppet] - 10https://gerrit.wikimedia.org/r/584690 [19:58:47] (03CR) 10Jbond: [C: 03+2] profile::tlsproxy::envoy: add missing hiera values [puppet] - 10https://gerrit.wikimedia.org/r/584690 (owner: 10Jbond) [20:00:04] halfak and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200330T2000). [20:05:30] (03CR) 10RLazarus: "Updated PCC, now that the new Datetime is merged: https://puppet-compiler.wmflabs.org/compiler1002/21626/" [puppet] - 10https://gerrit.wikimedia.org/r/582933 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [20:09:09] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['htmldumper1001.eqiad.wmnet'] ` and were **ALL** successful. [20:17:43] (03PS1) 10Jbond: cloud - puppet: use puppet5 and facter 3 by default [puppet] - 10https://gerrit.wikimedia.org/r/584696 [20:19:57] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-jumbo1008.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kafka-jumbo1008.eqiad.wmnet'] ` [20:24:06] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-jumbo1009.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kafka-jumbo1009.eqiad.wmnet'] ` [20:25:43] (03PS1) 10Jbond: cloude tlsproxy::envoy: remove ~ default [puppet] - 10https://gerrit.wikimedia.org/r/584699 [20:26:24] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-jumbo1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kafka-jumbo1007.eqiad.wmnet'] ` [20:26:35] (03CR) 10Jbond: [C: 03+2] cloude tlsproxy::envoy: remove ~ default [puppet] - 10https://gerrit.wikimedia.org/r/584699 (owner: 10Jbond) [20:46:07] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10Multichill) BGP is quite a slow protocol, you might want to tweak some timers or combine it with BFD. If BGP is giving you too much hassle, you might wa... [20:46:35] (03CR) 10Jforrester: ""FAILURE No change detected against the current configuration."" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584184 (owner: 10Gergő Tisza) [20:47:01] (03CR) 10Jforrester: [C: 03+1] Alphabetize GrowthExperiments settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584132 (owner: 10Gergő Tisza) [20:47:27] (03CR) 10Jforrester: [C: 03+1] Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) (owner: 10Gergő Tisza) [20:47:56] jouncebot: next [20:47:56] In 0 hour(s) and 12 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200330T2100) [20:48:03] Meh. [20:48:17] (03PS5) 10Jforrester: Construct wgLogos in CommonSettings so that projects can inherit values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 [20:48:25] Jdlrobson: Now a good time? [20:48:36] (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584132 (owner: 10Gergő Tisza) [20:49:02] (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) (owner: 10Gergő Tisza) [21:00:04] Reedy and sbassett: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200330T2100). [21:08:47] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10holger.knust) [21:13:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:15:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:36:18] (03PS1) 10Ssingh: Set UTC as the default timezone for the `until' argument [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/584713 [21:40:28] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10stjn) I was wondering why did this start coming up again :-) Alas. [21:41:32] (03PS2) 10Andrew Bogott: Switch Cloud VPS/Toolforge to Puppet 5 / Facter 3 [puppet] - 10https://gerrit.wikimedia.org/r/583030 (owner: 10Muehlenhoff) [21:41:37] (03CR) 10Ssingh: [C: 03+2] "Merging a trivial change (again); no change in core functionality." [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/584713 (owner: 10Ssingh) [21:42:29] (03CR) 10Andrew Bogott: [C: 03+2] Switch Cloud VPS/Toolforge to Puppet 5 / Facter 3 [puppet] - 10https://gerrit.wikimedia.org/r/583030 (owner: 10Muehlenhoff) [21:58:42] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Cmjohnson) 05Open→03Resolved @Dzahn I reimaged and am now able to login cmjohnson@Bolts2 ~ % ssh htmldumper1001.eqiad.wmnet Linux htmldumper1001 4.19.... [22:01:50] (03PS2) 10Gergő Tisza: Document the process of updating dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584610 [22:05:22] (03CR) 10Jforrester: [C: 04-1] "No double-spaces please." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584610 (owner: 10Gergő Tisza) [22:05:37] tgr: I'll hijack and merge. [22:08:13] (03PS3) 10Jforrester: Sign-post the process of updating dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584610 (owner: 10Gergő Tisza) [22:08:36] (03CR) 10Jforrester: [C: 03+2] Sign-post the process of updating dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584610 (owner: 10Gergő Tisza) [22:11:15] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:13:48] (03Merged) 10jenkins-bot: Sign-post the process of updating dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584610 (owner: 10Gergő Tisza) [22:16:08] (03PS6) 10Jforrester: Construct wgLogos in CommonSettings so that projects can inherit values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 [22:16:14] (03CR) 10Jforrester: [C: 03+2] Construct wgLogos in CommonSettings so that projects can inherit values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 (owner: 10Jforrester) [22:16:14] thx [22:16:23] (03PS4) 10Jforrester: Preparation for removal of $wgMobileFrontendLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [22:21:17] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10Jpita) >>! In T247722#6009888, @Aklapper wrote: > @Jpita: See the question in T247722#6004734 which is not about `Jpita` but about `Jose pita`. Well, if Jpita is my... [22:22:11] (03Merged) 10jenkins-bot: Construct wgLogos in CommonSettings so that projects can inherit values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 (owner: 10Jforrester) [22:24:14] Sorry about jenkins being slow - didn't realize the stack was so big [22:24:32] (03PS2) 10CDanis: Prepped depool of esams (just in case) [dns] - 10https://gerrit.wikimedia.org/r/583760 [22:27:04] James_F: yes okay to deploy. Sorry for missing that [22:27:16] No worries. Lots of things going on. :-) [22:27:18] my irc notifications were muted o_o [22:27:24] they are back now [22:28:08] James_F I just joined - what is going on? Also, I've been watching on zuul, and the ChessBrowser patches don't take long individually; I think it was the number, not the stack. Either way, sorry) [22:28:45] 10Operations, 10netops: mr1-esams i2c syslog flood - https://phabricator.wikimedia.org/T242097 (10RobH) [22:28:50] DannyS712: It's then stack. You must not create stacks of more than 5 patches. The exponential CI load is the problem, not the time it takes for each patch to run through CI. [22:29:07] DannyS712: I'm seconds away from just dropping CI for the repo. [22:29:16] 5 patch limit? Ok, noted. Sorry [22:29:27] It's not reasonable that a toy repo deployed nowhere has taken most of the resources of CI for half an hour. [22:29:38] Understood [22:31:32] (5 is a rough limit, and mostly about the number running at once.) [22:32:47] Noted. Is there a place this can be documented? Everything I've seen has mostly been trial and error since I was whitelisted; I couldn't find any docs about how CI works and what to not do (not that that is an excuse) [22:33:15] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Construct wgLogos in CommonSettings so that projects can inherit values (duration: 01m 02s) [22:33:18] It's documented on the CI documentation. [22:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:23] Somewhere. [22:33:40] The fact that gerrit won't let you even try to submit 10 patches is meant to be a hint. [22:34:02] That stack was created by hacking around the CI limits in the first place and took out CI for ~an hour when it was first pushed. [22:34:18] James_F let me know when you want me to test on mwdebug [22:34:55] Jdlrobson: Sorry, yeah, one moment, doing the underlying one first. [22:36:03] What do you mean "try to submit 10 patches"? https://gerrit.wikimedia.org/r/#/admin/projects/All-Projects,access says that "Trusted-Contributors" are allowed 20... maybe that should be reduced, just in case? [22:36:16] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Split wgLogos setting into wmgSiteLogo1x etc. (duration: 00m 59s) [22:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:21] *either way, not going to go over 5 again though [22:36:55] Thank you. [22:37:22] (03CR) 10Jforrester: [C: 03+2] Preparation for removal of $wgMobileFrontendLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [22:37:27] 10Operations, 10Scap (Scap3-MediaWiki-MVP): Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352 (10mmodell) I think this one should be resolved now? [22:37:43] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:37:56] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 57s) [22:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:53] (03Merged) 10jenkins-bot: Preparation for removal of $wgMobileFrontendLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [22:44:09] (03PS1) 10Jforrester: Remove unset wordmark values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584724 [22:44:34] (03CR) 10Jforrester: [C: 03+2] Remove unset wordmark values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584724 (owner: 10Jforrester) [22:44:54] Jdlrobson: Live on mwdebug1001, sorry. [22:45:36] (03Merged) 10jenkins-bot: Remove unset wordmark values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584724 (owner: 10Jforrester) [22:45:42] looking [22:45:59] (LGTM but want your sign-off too.) [22:47:29] yep lgtm! [22:47:32] ^ James_F [22:47:34] Cool. [22:50:07] !log jforrester@deploy1001 Synchronized wmf-config/mobile.php: Set wgMobileFrontendLogo from wgLogos['icon'] if set (duration: 00m 59s) [22:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:07] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Set wmgSiteLogoIcon for each project family and four special wikis (duration: 00m 58s) [22:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:28] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Provide wmgSiteLogoIcon (duration: 00m 57s) [22:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:06] Jdlrobson: All done. Thanks! [22:56:50] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 58s) [22:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:58] 10Operations, 10netops: cr3-knams:xe-0/1/3 down - https://phabricator.wikimedia.org/T244497 (10RobH) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200330T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:08:18] !log cdanis@cr3-knams# commit comment "sensible flow table sizes T248394" [23:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:24] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [23:12:33] (03PS4) 10Jforrester: Alphabetize GrowthExperiments settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584132 (owner: 10Gergő Tisza) [23:16:22] (03CR) 10Jforrester: [C: 03+2] Alphabetize GrowthExperiments settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584132 (owner: 10Gergő Tisza) [23:16:41] !log cr2-esams: commit flex-flow-sizing T248394 [23:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:47] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [23:17:54] (03Merged) 10jenkins-bot: Alphabetize GrowthExperiments settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584132 (owner: 10Gergő Tisza) [23:18:12] (03PS4) 10Jforrester: Sync growthexperiments dblist with actual state of wmgUseGrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) (owner: 10Gergő Tisza) [23:19:20] (03PS1) 10Jeena Huneidi: Replace phpenmod with configmap when enabling xdebug [deployment-charts] - 10https://gerrit.wikimedia.org/r/584733 (https://phabricator.wikimedia.org/T246921) [23:19:48] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Alphabetize wikis in each GrowthExperiments settings (duration: 00m 58s) [23:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:50] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 58s) [23:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:30] (03PS1) 10Jforrester: Drop fallback support for wgMobileFrontendLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584734 (https://phabricator.wikimedia.org/T248500) [23:30:11] !log cr3-esams: commit flex-flow-sizing T248394 [23:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:17] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [23:32:31] (03CR) 10Jforrester: [C: 03+1] Add export-11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584095 (https://phabricator.wikimedia.org/T238921) (owner: 10Reedy) [23:35:35] (03PS7) 10Mstyles: kibana: refactor kibana profile into two profiles [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) [23:37:15] (03PS8) 10Mstyles: kibana: move httpd proxy authentication to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) [23:40:39] (03CR) 10jerkins-bot: [V: 04-1] kibana: move httpd proxy authentication to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [23:46:44] 10Operations, 10netops, 10Patch-For-Review: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [23:50:38] (03PS9) 10Mstyles: kibana: move httpd proxy authentication to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961)