[00:00:04] Deploy window NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190705T0000) [04:52:55] (03PS2) 10Marostegui: Phabricator: Set taskmasters to 4 [puppet] - 10https://gerrit.wikimedia.org/r/520770 (https://phabricator.wikimedia.org/T227251) (owner: 1020after4) [04:53:25] (03CR) 10Marostegui: [C: 03+2] Phabricator: Set taskmasters to 4 [puppet] - 10https://gerrit.wikimedia.org/r/520770 (https://phabricator.wikimedia.org/T227251) (owner: 1020after4) [05:01:27] (03PS1) 10Marostegui: mariadb: Decommission db1069 [puppet] - 10https://gerrit.wikimedia.org/r/520828 (https://phabricator.wikimedia.org/T227166) [05:02:27] !log Remove db1069 from tendril and zarcillo - T227166 [05:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:33] T227166: decommission db1069 - https://phabricator.wikimedia.org/T227166 [05:04:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1069 [puppet] - 10https://gerrit.wikimedia.org/r/520828 (https://phabricator.wikimedia.org/T227166) (owner: 10Marostegui) [05:08:19] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.makevm [05:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:58] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.makevm [05:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:42] !log Stop MySQL on db1069 for decommission T227166 [05:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:47] T227166: decommission db1069 - https://phabricator.wikimedia.org/T227166 [05:11:10] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:17:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [05:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520830 (https://phabricator.wikimedia.org/T227062) [05:18:39] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [05:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:39] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520830 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:21:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520830 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:21:51] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) >>! In T226952#5295368, @Marostegui wrote: > Note: db2044 needs upgrading This was done [05:22:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520830 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:22:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 for upgrade (duration: 00m 51s) [05:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:23] !log Upgrade db1104 T227062 [05:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:28] T227062: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 [05:35:58] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10Marostegui) >>! In T227251#5306948, @mmodell wrote: > Now the graphs look better. Unfortunately, puppet will set the config back to 10 taskmasters unless we m... [05:38:25] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520833 [05:41:04] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review: Data model for dbconfig - https://phabricator.wikimedia.org/T197531 (10Marostegui) @Joe @CDanis is this task still valid? [05:41:24] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520833 (owner: 10Marostegui) [05:42:17] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520833 (owner: 10Marostegui) [05:42:39] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520833 (owner: 10Marostegui) [05:42:49] (03PS1) 10Vgutierrez: install_server: Add DHCP entries for ncredir[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/520836 (https://phabricator.wikimedia.org/T133548) [05:43:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1104 after upgrade (duration: 00m 49s) [05:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:59] (03CR) 10Vgutierrez: [C: 03+2] install_server: Add DHCP entries for ncredir[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/520836 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [05:46:53] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520838 [05:54:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520838 (owner: 10Marostegui) [05:55:00] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520838 (owner: 10Marostegui) [05:56:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1104 after upgrade (duration: 00m 49s) [05:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:24] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520838 (owner: 10Marostegui) [05:57:28] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [05:57:39] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Introduce the concept of shared certificates [puppet] - 10https://gerrit.wikimedia.org/r/517660 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [05:57:49] (03PS7) 10Vgutierrez: acme_chief: Introduce the concept of shared certificates [puppet] - 10https://gerrit.wikimedia.org/r/517660 (https://phabricator.wikimedia.org/T133548) [06:01:33] (03PS25) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [06:09:44] (03PS1) 10Vgutierrez: hieradata: Grant ncredir instances access to the ncredir certificates [puppet] - 10https://gerrit.wikimedia.org/r/520840 (https://phabricator.wikimedia.org/T133548) [06:14:01] (03PS1) 10Vgutierrez: site: Add ncredir[12]001 instances definition [puppet] - 10https://gerrit.wikimedia.org/r/520841 (https://phabricator.wikimedia.org/T133548) [06:17:21] (03CR) 10Jcrespo: "> Good point, I think that might have applied only to Prometheus 1. IMHO worth trying not force creation of empty files while we're at it " [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [06:18:28] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Provide support for nginx in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/513279 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [06:18:38] (03PS4) 10Vgutierrez: redirects.dat: Provide support for nginx in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/513279 (https://phabricator.wikimedia.org/T224539) [06:19:14] (03PS5) 10Jcrespo: mariadb: Prepare core for buster [puppet] - 10https://gerrit.wikimedia.org/r/519073 (https://phabricator.wikimedia.org/T193224) [06:19:16] (03PS14) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [06:22:01] 10Operations, 10Traffic: Provide nginx support in compile_redirects() - https://phabricator.wikimedia.org/T224539 (10Vgutierrez) 05Open→03Resolved [06:22:08] 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Vgutierrez) [06:24:45] (03PS1) 10Marostegui: db1109: Convert it to candidate master [puppet] - 10https://gerrit.wikimedia.org/r/520842 (https://phabricator.wikimedia.org/T227062) [06:25:21] (03CR) 10Marostegui: [C: 03+2] db1109: Convert it to candidate master [puppet] - 10https://gerrit.wikimedia.org/r/520842 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [06:27:57] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) >>! In T227288#5307228, @MoritzMuehlenhoff wrote: > Should these really be both in eqiad? The initial use case is for analytics, but we migh... [06:30:31] (03CR) 10Elukey: [C: 03+1] Update a number of comments still referring to Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/520764 (owner: 10Muehlenhoff) [06:32:40] (03PS1) 10Marostegui: db-codfw.php: Clean up old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520843 [06:32:44] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:33:51] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Clean up old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520843 (owner: 10Marostegui) [06:34:41] (03Merged) 10jenkins-bot: db-codfw.php: Clean up old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520843 (owner: 10Marostegui) [06:35:49] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove old comments (duration: 00m 50s) [06:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:12] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 [06:36:32] (03CR) 10jenkins-bot: db-codfw.php: Clean up old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520843 (owner: 10Marostegui) [06:36:35] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review: Data model for dbconfig - https://phabricator.wikimedia.org/T197531 (10Volans) 05Open→03Resolved a:03Volans The data model is now part of the software and will evolve with it, wikitech documentation will be provided for it. I'm resol... [06:36:41] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10Volans) [06:38:18] (03CR) 10Volans: [C: 03+2] Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519752 (owner: 10Volans) [06:40:55] (03Merged) 10jenkins-bot: Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519752 (owner: 10Volans) [06:40:57] (03PS4) 10Jcrespo: Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 [06:42:27] (03CR) 10Jcrespo: [C: 03+1] Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 (owner: 10Jcrespo) [06:43:28] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 (owner: 10Jcrespo) [06:43:50] (03CR) 10Volans: [C: 03+2] debian: Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519753 (owner: 10Volans) [06:44:25] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 (owner: 10Jcrespo) [06:46:08] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1109 with full weight (duration: 00m 49s) [06:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:17] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10Volans) All patches for v1 of dbconfig are merged, including the ones to make a new conftool releas... [06:46:25] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 (owner: 10Jcrespo) [06:46:27] (03Merged) 10jenkins-bot: debian: Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519753 (owner: 10Volans) [06:48:39] (03PS1) 10Jcrespo: mariadb: Depool db1087 (s8 sanitarium master) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520844 [06:52:06] 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) The PR is still waiting for the second upstream review, since there is no real rush I'd prefer to wait for t... [06:58:52] 10Operations, 10Goal, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10elukey) [06:59:19] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274 (10elukey) [06:59:54] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:01:10] 10Operations, 10Analytics, 10EventBus, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048 (10elukey) 05Open→03Declined Eventbus is on its road to decommission in favor of event-gate, I'd close this task since probably not relevant any... [07:02:28] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1087 (s8 sanitarium master) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520844 (owner: 10Jcrespo) [07:05:20] (03PS2) 10Muehlenhoff: prometheus-snmp-exporter: Switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/520773 (https://phabricator.wikimedia.org/T194724) [07:08:53] (03CR) 10Muehlenhoff: [C: 03+2] prometheus-snmp-exporter: Switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/520773 (https://phabricator.wikimedia.org/T194724) (owner: 10Muehlenhoff) [07:10:46] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1087 (s8 sanitarium master) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520844 (owner: 10Jcrespo) [07:11:01] (03Merged) 10jenkins-bot: mariadb: Depool db1087 (s8 sanitarium master) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520844 (owner: 10Jcrespo) [07:11:03] (03CR) 10jenkins-bot: mariadb: Depool db1087 (s8 sanitarium master) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520844 (owner: 10Jcrespo) [07:13:18] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1087 (duration: 00m 52s) [07:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:23] (03PS1) 10Muehlenhoff: Revert "prometheus-snmp-exporter: Switch to systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/520845 [07:16:13] (03CR) 10jerkins-bot: [V: 04-1] Revert "prometheus-snmp-exporter: Switch to systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/520845 (owner: 10Muehlenhoff) [07:16:16] PROBLEM - puppet last run on netmon1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:17:24] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Revert "prometheus-snmp-exporter: Switch to systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/520845 (owner: 10Muehlenhoff) [07:17:52] !log Compress small wikis on labsdb1009 T222978 [07:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:57] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [07:19:56] 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) Tried to check in /var/log/apt/history the packages installed to make the Tensorflow and Thumbor (uses OpenCL) use case working: ` cxlactivitylogger hcc hsa-rocr-dev... [07:21:42] RECOVERY - puppet last run on netmon1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:23:40] !log installing wireshark security updates on jessie [07:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:48] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1087 (s8 sanitarium master) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520847 [07:33:38] 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) Also there seems to be some movement in Debian for rocm: https://lists.debian.org/debian-devel/2019/06/msg00302.html [07:35:25] !log installing imagemagick security updates on jessie [07:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:12] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1087 (s8 sanitarium master) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520847 [07:49:50] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1087 (s8 sanitarium master) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520847 (owner: 10Jcrespo) [07:51:03] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1087 (s8 sanitarium master) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520847 (owner: 10Jcrespo) [07:51:05] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1087 (s8 sanitarium master) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520847 (owner: 10Jcrespo) [07:57:07] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1087 (duration: 00m 48s) [07:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:14] (03PS1) 10Elukey: aptrepo: add component/amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) [08:33:55] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) After ~20h `ms-be2037` of running with "os control" set and `powersave` governor seems to behave fine. Compared to `performance` cpu load is slightly higher as expected and temperature slightl... [08:42:16] (03PS1) 10Jcrespo: Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 [08:42:49] (03CR) 10jerkins-bot: [V: 04-1] Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [08:46:32] (03PS2) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [08:46:34] (03PS2) 10Jcrespo: Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 [08:46:55] (03CR) 10jerkins-bot: [V: 04-1] replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 (owner: 10Jcrespo) [08:46:59] (03CR) 10jerkins-bot: [V: 04-1] Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [08:47:29] (03PS3) 10Jcrespo: Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 [08:49:38] (03PS3) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [08:50:01] (03CR) 10jerkins-bot: [V: 04-1] replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 (owner: 10Jcrespo) [08:51:59] (03PS1) 10Muehlenhoff: Add library hint for postgres [puppet] - 10https://gerrit.wikimedia.org/r/520852 [08:53:21] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for postgres [puppet] - 10https://gerrit.wikimedia.org/r/520852 (owner: 10Muehlenhoff) [08:53:45] a heads up, the VM running irc.wikimedia.org will be rebooted in about ten minutes for a security update (all clients have been automatically reconnecting in the past) [08:54:05] (03CR) 10Marostegui: [C: 03+1] "<3 thanks!" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [08:54:53] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [08:54:54] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:51] ACKNOWLEDGEMENT - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100% Gehel tracked on https://phabricator.wikimedia.org/T227298 [09:00:47] (03PS2) 10Gehel: cloudelastic: use the proper check for SSL certificates [puppet] - 10https://gerrit.wikimedia.org/r/520782 [09:01:13] !log rebooting kraz (irc.wikimedia.org) to pick up MDS-enabled qemu [09:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:36] (03CR) 10Gehel: [C: 03+2] cloudelastic: use the proper check for SSL certificates [puppet] - 10https://gerrit.wikimedia.org/r/520782 (owner: 10Gehel) [09:06:51] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) Continuing the fleetwide audit, my impression is that unless explicitly set by puppet the governor should be `powersave`, thus the hosts that currently don't have that are: == Dell == `cumin... [09:12:03] (03CR) 10Jcrespo: "./switchover.py es2002 es2001 --read-only-master" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [09:12:36] (03CR) 10Jcrespo: "Feel also free to criticize the wording for each one, as I have run of creativity." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [09:14:06] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10hashar) I am not sure whether it is related, but a month or so ago I have noticed that the old cloudvirt machines to have poor CPU performance for an unknown reason yet. We have made a benchmark on labte... [09:15:03] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=codfw,name=elastic2054.codfw.wmnet [09:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:24] (03CR) 10Marostegui: [C: 03+1] "> Feel also free to criticize the wording for each one, as I have run" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [09:25:02] 10Operations: Support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T202255 (10MoritzMuehlenhoff) 05Open→03Declined We swapped the NICs in these servers to a model supported by 4.9 (Broadcom BCM57412) and for any new deployments we can use Buster which has a 4.19 k... [09:26:16] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Gehel) elastic2054 is down again. It is set to pooled=inactive, and marked as failed in netbox. @Papaul: it looks like this is going to need your help. You can do whatever you need with this server and reboot i... [09:28:31] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:28:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:05] !log rebooting LDAP replicas in eqiad [09:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:26] (03PS1) 10Ema: cache: reimage cp1086 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520860 (https://phabricator.wikimedia.org/T226638) [09:39:57] !log depool cp1086 and reimage as upload_ats T226638 [09:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:01] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [09:41:14] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) In terms of scaling drivers, here's the list of hosts that don't have `intel_pstate` (which AIUI is what we want to use) `cumin -p99 -b100 'F:virtual ~ physical' 'cat /sys/devices/system/cpu/... [09:41:30] (03CR) 10Ema: [C: 03+1] "Seems fine and pcc agrees https://puppet-compiler.wmflabs.org/compiler1001/17238/" [puppet] - 10https://gerrit.wikimedia.org/r/520774 (owner: 10Muehlenhoff) [09:42:00] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) >>! In T225713#5307833, @hashar wrote: > I am not sure whether it is related, but a month or so ago I have noticed that the old cloudvirt machines to have poor CPU performance for an unknown r... [09:42:38] (03CR) 10Ema: [C: 03+2] cache: reimage cp1086 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520860 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [09:45:28] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1086.eqiad.wmnet'] ` The log can be found in `... [09:47:11] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10elukey) The two remaining use cases are: * labswiki * thumbor The latter should be doable, but the former seems a bit more complicated... [09:48:41] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:48:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:15] !log rebooting serpens to pick up MDS-enabled qemu [09:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:42] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) @fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any sugg... [09:52:44] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) a:03Qgil [10:00:29] !log rebooting seaborgium to pick up MDS-enabled qemu [10:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:52] !log Rolling rebood rdb* hosts - T227304 [10:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:57] T227304: Reboot rdb* cluster - https://phabricator.wikimedia.org/T227304 [10:06:23] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:06:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:01] (03CR) 10Vgutierrez: [C: 03+2] hieradata: Grant ncredir instances access to the ncredir certificates [puppet] - 10https://gerrit.wikimedia.org/r/520840 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:09:10] (03PS2) 10Vgutierrez: hieradata: Grant ncredir instances access to the ncredir certificates [puppet] - 10https://gerrit.wikimedia.org/r/520840 (https://phabricator.wikimedia.org/T133548) [10:14:08] (03CR) 10Jbond: [V: 03+2 C: 03+2] "plus 2" [labs/private] - 10https://gerrit.wikimedia.org/r/520776 (owner: 10Jbond) [10:14:37] !log fixed up kernel packages on serpens/seaborgium, these were dist-upgraded from jessie, but the correct kernel packages for Stretch were not setup, as such there were still stuck with an old jessie kernel [10:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:34] !log rebooting serpens to pick up correct Stretch kernel [10:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:25] (03PS2) 10Vgutierrez: site: Add ncredir[12]001 instances definition [puppet] - 10https://gerrit.wikimedia.org/r/520841 (https://phabricator.wikimedia.org/T133548) [10:18:26] (03PS1) 10Vgutierrez: install_server: Add disk layout for ncredir[12]001 instances [puppet] - 10https://gerrit.wikimedia.org/r/520865 (https://phabricator.wikimedia.org/T133548) [10:19:37] (03CR) 10Vgutierrez: [C: 03+2] install_server: Add disk layout for ncredir[12]001 instances [puppet] - 10https://gerrit.wikimedia.org/r/520865 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:23:10] (03CR) 10Vgutierrez: [C: 03+2] site: Add ncredir[12]001 instances definition [puppet] - 10https://gerrit.wikimedia.org/r/520841 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:23:12] !log rebooting seaborgium to pick up correct Stretch kernel [10:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:39] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:23:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:22] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:29:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:45] !log rebooting debug proxies to pick up MDS-enabled qemu [10:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:02] PROBLEM - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 354 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [10:31:13] PROBLEM - LVS HTTP IPv4 on docker-registry.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 354 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:31:15] ^ expected [10:31:18] PROBLEM - Docker registry health on registry2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 235 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [10:31:31] ^ expected [10:31:34] PROBLEM - Docker registry HTTPS interface on registry2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2002.codfw.wmnet:443/v2/wikimedia-stretch/manifests/latest - 354 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Docker [10:31:44] PROBLEM - Docker registry HTTPS interface on registry2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2001.codfw.wmnet:443/v2/wikimedia-stretch/manifests/latest - 354 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Docker [10:31:50] PROBLEM - Docker registry health on registry2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 235 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [10:32:00] kk, thanks jijiki [10:32:21] k (that was a page) [10:32:30] RECOVERY - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 292 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [10:32:41] RECOVERY - LVS HTTP IPv4 on docker-registry.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 292 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:32:46] RECOVERY - Docker registry health on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [10:33:02] RECOVERY - Docker registry HTTPS interface on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 2545 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Docker [10:33:10] RECOVERY - Docker registry HTTPS interface on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 2545 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Docker [10:33:18] RECOVERY - Docker registry health on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [10:33:28] indeed, the rdb hosts reboot triggered registry failure ? [10:34:03] yes [10:34:07] from logs [10:34:11] https://www.irccloud.com/pastebin/0PLYj8sc/ [10:34:13] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Qgil) [10:34:21] that is unexpected i'll fill a task [10:36:39] ok then it was expected from my POV:p [10:36:55] ack, thanks [10:52:40] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1086.eqiad.wmnet'] ` and were **ALL** successful. [10:55:15] !log pool cp1086 w/ ATS backend T226638 [10:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:21] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [10:59:10] (03PS1) 10Ema: cache: reimage cp1088 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520867 (https://phabricator.wikimedia.org/T226638) [11:00:11] !log depool cp1088 and reimage as upload_ats T226638 [11:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:19] (03CR) 10Ema: [C: 03+2] cache: reimage cp1088 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520867 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [11:02:47] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10fgiunchedi) >>! In T224454#5307968, @elukey wrote: > @fgiunchedi I noticed that node_network_transmit_bytes_total is already u... [11:04:11] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1088.eqiad.wmnet'] ` The log can be found in `... [11:04:57] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [11:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:09] !log jmm@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [11:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:47] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [11:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:56] !log jmm@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [11:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:51] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [11:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:01] !log jmm@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [11:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:01] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10fgiunchedi) [11:19:13] 10Operations: creation of prometheus_puppet_agent_stats fails on first puppet run - https://phabricator.wikimedia.org/T227315 (10Vgutierrez) [11:21:56] (03PS1) 10Vgutierrez: prometheus: Fix prometheus_puppet_agent_stats dependencies [puppet] - 10https://gerrit.wikimedia.org/r/520869 (https://phabricator.wikimedia.org/T227315) [11:26:32] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Fix prometheus_puppet_agent_stats dependencies [puppet] - 10https://gerrit.wikimedia.org/r/520869 (https://phabricator.wikimedia.org/T227315) (owner: 10Vgutierrez) [11:28:23] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Fix prometheus_puppet_agent_stats dependencies [puppet] - 10https://gerrit.wikimedia.org/r/520869 (https://phabricator.wikimedia.org/T227315) (owner: 10Vgutierrez) [11:28:36] (03PS2) 10Vgutierrez: prometheus: Fix prometheus_puppet_agent_stats dependencies [puppet] - 10https://gerrit.wikimedia.org/r/520869 (https://phabricator.wikimedia.org/T227315) [11:31:02] !log installing postgresql-9.4 updates on jessie [11:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:59] 10Operations: creation of prometheus_puppet_agent_stats fails on first puppet run - https://phabricator.wikimedia.org/T227315 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Normal a:03Vgutierrez [11:32:08] that was fast.. [11:32:17] !log Upgrading smartarray firmware on ms-be1021 - T141756 - T227076 [11:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:24] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [11:32:24] T227076: Upgrade firmware on ms-be1021 (Was: Degraded RAID on ms-be1021) - https://phabricator.wikimedia.org/T227076 [11:33:13] vgutierrez: \o/ [11:33:16] thank you [11:33:19] np :D [11:35:03] (03PS1) 10Ema: cache: reimage cp1090 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520870 (https://phabricator.wikimedia.org/T226638) [11:38:26] !log Reboot ms-be1021 - T141756 - T227076 [11:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:32] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [11:38:32] T227076: Upgrade firmware on ms-be1021 (Was: Degraded RAID on ms-be1021) - https://phabricator.wikimedia.org/T227076 [11:39:40] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:39:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:14] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1088.eqiad.wmnet'] ` and were **ALL** successful. [11:46:56] !log pool cp1088 w/ ATS backend T226638 [11:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:01] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [11:47:26] 10Operations, 10ops-eqiad, 10serviceops: Upgrade firmware on ms-be1021 (Was: Degraded RAID on ms-be1021) - https://phabricator.wikimedia.org/T227076 (10jijiki) 05Open→03Resolved a:03jijiki There are still messages like ` [ 122.753602] perf: interrupt took too long (2953 > 2500), lowering kernel.per... [11:56:31] (03PS1) 10Ema: cache_upload: remove varnish from frontend::backend_services [puppet] - 10https://gerrit.wikimedia.org/r/520872 (https://phabricator.wikimedia.org/T226589) [12:02:28] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10MoritzMuehlenhoff) >>! In T227288#5307686, @elukey wrote: > This is a very good point. Would we have only one KDC per datacenter? I think having o... [12:02:40] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10MoritzMuehlenhoff) p:05Triage→03Normal [12:02:48] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10ema) @Gilles: is there anything left to be done here? Other than blogging about the results that is. :-) [12:03:56] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10MoritzMuehlenhoff) p:05Triage→03Normal [12:05:18] 10Operations, 10Analytics, 10Traffic: Increased number of webrequest sequence-numbers alarms (mostly) on upload webrequest-source - https://phabricator.wikimedia.org/T225786 (10ema) [12:06:05] (03PS4) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [12:06:18] (03CR) 10jerkins-bot: [V: 04-1] replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 (owner: 10Jcrespo) [12:09:11] (03PS1) 10Ema: ATS: do not add Server: header [puppet] - 10https://gerrit.wikimedia.org/r/520875 (https://phabricator.wikimedia.org/T224119) [12:09:16] (03PS5) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [12:12:03] !log depool cp1090 and reimage as upload_ats T226638 [12:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:08] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [12:13:08] (03CR) 10Ema: [C: 03+2] cache: reimage cp1090 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520870 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [12:14:38] (03Abandoned) 10Hashar: cassandra: fix spec service provider [puppet] - 10https://gerrit.wikimedia.org/r/503996 (owner: 10Hashar) [12:15:17] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1090.eqiad.wmnet'] ` The log can be found in `... [12:18:28] (03PS2) 10Jbond: Remove support for Ubuntu from os_version and related tests [puppet] - 10https://gerrit.wikimedia.org/r/520765 (owner: 10Muehlenhoff) [12:23:31] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10hashar) [12:24:16] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10hashar) Seems `krypton.eqiad.wmnet` is still using Jessie / php5.6. We could use an upgrade to Stretch to drop php5.6 support from the CI infrastructure :-] [12:24:46] (03CR) 10Muehlenhoff: [C: 04-1] aptrepo: add component/amd-rocm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [12:31:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks fine, besides the fleet-wide default ports like SSH, the only thing externally reachable in need of a ferm rule is rsyncd, but stati" [puppet] - 10https://gerrit.wikimedia.org/r/520706 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [12:31:46] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10MoritzMuehlenhoff) p:05High→03Normal [12:39:26] 10Operations: HP Gen9 onboard controller review - https://phabricator.wikimedia.org/T216175 (10MoritzMuehlenhoff) I saw this task during clinic duty and I'm wondering what/if there's anything left to be done? S100i SR SW RAID seems to be about some HP software offering for Windows to run a software RAID, but we... [12:41:55] 10Operations, 10netbox: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634 (10MoritzMuehlenhoff) Is rebooted the Netbox hosts (1002, 2001) for the MDS kernel issues this week and that does not seem to be an issue any more. Can this bug be closed or is there anyth... [12:43:39] 10Operations, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10MoritzMuehlenhoff) @Vgutierrez The firmware update on the NICs fixed this for good, right? Can we close this task? [12:46:48] 10Operations, 10Patch-For-Review: logrotate for visualdiff tests on Parsoid test host (scandium) - https://phabricator.wikimedia.org/T161920 (10MoritzMuehlenhoff) [12:47:00] 10Operations, 10vm-requests: Site: eqiad/codfw 2 VMs each for pool counters - https://phabricator.wikimedia.org/T226811 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [12:50:29] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10Marostegui) 05Open→03Resolved a:03mmodell Just to clarify, we have lowered the priority because the slaves are no longer lagging. A few minutes ago the... [12:52:21] (03CR) 10Elukey: aptrepo: add component/amd-rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [12:53:11] 10Operations, 10Wikimedia-Mailing-lists: LGBT mailing list moderator password reset - https://phabricator.wikimedia.org/T225787 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I'm marking this as resolved, please reopen if anything else needs to be done. [12:53:32] (03PS2) 10Elukey: aptrepo: add component/amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) [12:56:29] 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10fgiunchedi) a:03Papaul @Papaul please order / replace this disk when you get a chance! [12:58:30] (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: add component/amd-rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [12:59:14] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) Makes sense, the extra latency to codfw shouldn't be a big deal. I know that we need to have only one kadmin server, but I was thinking abou... [13:01:16] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1090.eqiad.wmnet'] ` and were **ALL** successful. [13:03:06] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) >>! In T224454#5308149, @fgiunchedi wrote: >>>! In T224454#5307968, @elukey wrote: >> @fgiunchedi I noticed that node_... [13:04:16] (03CR) 10Elukey: aptrepo: add component/amd-rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [13:05:01] (03PS3) 10Elukey: aptrepo: add component/amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) [13:05:25] !log pool cp1090 w/ ATS backend T226638 [13:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:31] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [13:06:09] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ema) 05Open→03Resolved a:03ema With the conversion of cp1090 this is now done. [13:06:12] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes - https://phabricator.wikimedia.org/T226589 (10ema) [13:06:36] (03Abandoned) 10Ema: Revert "Normalize thumbnail URLs to avoid cachebusting" [puppet] - 10https://gerrit.wikimedia.org/r/518231 (owner: 10Ema) [13:06:55] (03Abandoned) 10Ema: package_builder: move lintian out of require_package [puppet] - 10https://gerrit.wikimedia.org/r/506679 (owner: 10Ema) [13:07:12] (03Abandoned) 10Ema: ATS: log cache results and backend URL [puppet] - 10https://gerrit.wikimedia.org/r/477245 (owner: 10Ema) [13:10:13] (03PS7) 10Fsero: adding a buster docker base image [puppet] - 10https://gerrit.wikimedia.org/r/520503 [13:11:45] (03CR) 10Fsero: [C: 03+2] adding a buster docker base image [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [13:20:30] 10Operations, 10ops-codfw: lvs2002 possible broken BBU - https://phabricator.wikimedia.org/T223949 (10MoritzMuehlenhoff) a:03Papaul [13:22:49] 10Operations, 10ops-eqiad, 10Traffic: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10ema) 05Open→03Resolved a:03ema The host has been in production for weeks without issues now. Closing. [13:23:27] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:23:45] 10Operations, 10cloud-services-team: Investigate use of hp-asrd on HPE servers - https://phabricator.wikimedia.org/T221939 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:25:30] (03PS3) 10Fsero: registry: improving swift replication [puppet] - 10https://gerrit.wikimedia.org/r/519018 [13:25:39] (03CR) 10Fsero: [C: 03+2] registry: improving swift replication [puppet] - 10https://gerrit.wikimedia.org/r/519018 (owner: 10Fsero) [13:26:26] (03PS2) 10Ema: cache_upload: remove varnish from frontend::backend_services [puppet] - 10https://gerrit.wikimedia.org/r/520872 (https://phabricator.wikimedia.org/T226589) [13:26:57] !log restarting swift-container-sync on swift backends [13:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:13] (03CR) 10Ema: [C: 03+2] cache_upload: remove varnish from frontend::backend_services [puppet] - 10https://gerrit.wikimedia.org/r/520872 (https://phabricator.wikimedia.org/T226589) (owner: 10Ema) [13:28:03] ema safe to merge? [13:28:22] if no you can merge mine at your convenience [13:28:22] fsero: yes, please go ahead [13:28:35] done ty [13:29:57] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:29:58] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:29:58] PROBLEM - puppet last run on ms-be1038 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:01] PROBLEM - puppet last run on ms-be1046 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:07] PROBLEM - puppet last run on ms-be2050 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:13] PROBLEM - puppet last run on ms-be1048 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:15] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:22] er/ [13:30:23] PROBLEM - puppet last run on ms-be1045 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:24] err [13:30:35] PROBLEM - puppet last run on ms-be1044 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:42] what are those ms-be.* servers for? [13:30:43] PROBLEM - puppet last run on ms-be2044 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:45] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:45] PROBLEM - puppet last run on ms-be2047 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:45] PROBLEM - puppet last run on ms-be1029 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:47] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:54] godog: ^ [13:30:55] hauskatze: swift back end [13:30:57] PROBLEM - puppet last run on ms-be1041 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:01] PROBLEM - puppet last run on ms-be2033 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:06] apergos: ack, thanks for explaining :) [13:31:13] (03PS1) 10Milimetric: Update Mediawiki Reduced snapshot for AQS [puppet] - 10https://gerrit.wikimedia.org/r/520884 [13:31:23] PROBLEM - puppet last run on ms-be1050 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:26] looking into that [13:31:31] PROBLEM - puppet last run on ms-be1036 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:36] fsero: any chance this is you? [13:31:38] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:40] (03CR) 10Muehlenhoff: [C: 04-1] aptrepo: add component/amd-rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [13:31:47] PROBLEM - puppet last run on ms-be2040 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:48] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:49] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:15] PROBLEM - puppet last run on ms-be2029 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:21] PROBLEM - puppet last run on ms-be2038 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:23] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:29] PROBLEM - puppet last run on ms-be2037 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:41] PROBLEM - puppet last run on ms-be2046 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:47] PROBLEM - puppet last run on ms-be2031 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:47] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:57] PROBLEM - puppet last run on ms-be2036 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:15] PROBLEM - puppet last run on ms-be2048 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:17] PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:18] apergos: maybe [13:33:21] lookint into it [13:33:23] PROBLEM - puppet last run on ms-be1042 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:27] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:28] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:28] PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:31] ty [13:33:41] PROBLEM - puppet last run on ms-be1040 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:51] PROBLEM - puppet last run on ms-be1049 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:55] !log disabling puppet on swift backends [13:33:57] PROBLEM - puppet last run on ms-be2039 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:58] PROBLEM - puppet last run on ms-be2041 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:58] hmmh, second puppet run on ms-be1026 worked fine for me [13:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:59] PROBLEM - puppet last run on ms-be2024 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:15] (03PS2) 10Elukey: role::aqs: update druid config [puppet] - 10https://gerrit.wikimedia.org/r/520884 (owner: 10Milimetric) [13:34:21] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:21] PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:25] PROBLEM - puppet last run on ms-be1037 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:25] PROBLEM - puppet last run on ms-be2020 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:31] PROBLEM - puppet last run on ms-be2043 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:35] PROBLEM - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:38] PROBLEM - puppet last run on ms-be2045 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:38] PROBLEM - puppet last run on ms-be2049 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:38] PROBLEM - puppet last run on ms-be2042 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:41] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:34:46] (03CR) 10Elukey: [C: 03+2] role::aqs: update druid config [puppet] - 10https://gerrit.wikimedia.org/r/520884 (owner: 10Milimetric) [13:35:21] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:35:21] PROBLEM - puppet last run on ms-be1039 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:35:21] PROBLEM - puppet last run on ms-be2028 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:36:01] likewise on a second host I tried (be-2039) [13:36:45] moritzm: yeah it also worked for me on other hosts [13:36:54] i think maybe puppetmaster got overloaded [13:37:12] (03PS4) 10Elukey: aptrepo: add component/amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) [13:37:13] anyhow i disabled puppet on all swift backends i will reenable them and run puppet on batches [13:37:27] sorry for the spam [13:37:28] were they all queued up to run at once? [13:37:38] how many boxes is that, roughly? [13:37:43] anything I can do to help fsero ? [13:39:07] apergos: more than 50 i think [13:39:39] godog: let me find out whats happening and i'll let you know, nothing seems broken on swift so we are good [13:40:02] ack [13:43:24] 10Operations, 10Traffic: Rename role::cache::upload_ats to role::cache::upload - https://phabricator.wikimedia.org/T227328 (10ema) [13:43:47] 10Operations, 10Traffic: Rename role::cache::upload_ats to role::cache::upload - https://phabricator.wikimedia.org/T227328 (10ema) p:05Triage→03Normal [13:44:03] !log roll restart of aqs on aqs100* to pick up new druid settings [13:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:15] RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:47:25] (03PS1) 10Jbond: facter - cpu_details: add governor fact [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) [13:48:19] (03PS2) 10Jbond: facter - cpu_details: add governor fact [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) [13:52:43] godog: im about to do this 'sudo -i cumin -m sync -b 3 'A:swift-be' 'bash -c "puppet agent --enable; run-puppet-agent"' [13:52:43] 70 hosts will be targeted: [13:52:43] ms-be[2016-2050].codfw.wmnet,ms-be[1016-1050].eqiad.wmnet' [13:52:52] anything to object? [13:53:12] should run on batches of 3 but im thinking on adding an extra sleep just inc ase [13:54:08] (03CR) 10Filippo Giunchedi: facter - cpu_details: add governor fact (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) (owner: 10Jbond) [13:55:01] fsero: did it work on a single host already ? but yeah lgtm [13:55:07] yeah it worked [13:55:41] RECOVERY - puppet last run on ms-be2024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:55:55] fsero: ack, then I guess the question is why did it fail ? [13:56:18] puppetmaster got overloaded? i have a question back for you, do we have metrics on prometheus for puppetmaster? [13:56:24] guess so so ill check them [13:57:18] we've got a dashboard for puppetdb, not sure about puppet master [13:57:33] (03PS1) 10Muehlenhoff: Add DNS entries for poolcounter100[45] and poolcounter200[34] [dns] - 10https://gerrit.wikimedia.org/r/520887 (https://phabricator.wikimedia.org/T226811) [13:58:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [14:00:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add DNS entries for poolcounter100[45] and poolcounter200[34] [dns] - 10https://gerrit.wikimedia.org/r/520887 (https://phabricator.wikimedia.org/T226811) (owner: 10Muehlenhoff) [14:01:01] RECOVERY - puppet last run on ms-be1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:01:31] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:01:47] RECOVERY - puppet last run on ms-be2042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:02:31] RECOVERY - puppet last run on ms-be2050 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:02:59] RECOVERY - puppet last run on ms-be1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:03:09] RECOVERY - puppet last run on ms-be2044 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [14:04:45] RECOVERY - puppet last run on ms-be2029 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [14:05:47] RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:06:13] RECOVERY - puppet last run on ms-be1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:07:08] RECOVERY - puppet last run on ms-be2027 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:07:47] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:08:55] RECOVERY - puppet last run on ms-be2033 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:10:52] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entries for poolcounter100[45] and poolcounter200[34] [dns] - 10https://gerrit.wikimedia.org/r/520887 (https://phabricator.wikimedia.org/T226811) (owner: 10Muehlenhoff) [14:12:25] RECOVERY - puppet last run on ms-be2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:39] RECOVERY - puppet last run on ms-be2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:13:21] RECOVERY - puppet last run on ms-be2028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:14:01] RECOVERY - puppet last run on ms-be1028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:15:01] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:15:14] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [14:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:21] RECOVERY - puppet last run on ms-be2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:51] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:04] (03PS3) 10Jbond: facter - cpu_details: add governor and scaling_driver facts [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) [14:18:05] RECOVERY - puppet last run on ms-be2049 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:18:42] (03CR) 10jerkins-bot: [V: 04-1] facter - cpu_details: add governor and scaling_driver facts [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) (owner: 10Jbond) [14:18:43] RECOVERY - puppet last run on ms-be1039 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:21:13] RECOVERY - puppet last run on ms-be2038 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:21:13] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:22:17] RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [14:22:18] (03PS4) 10Jbond: facter - cpu_details: add governor and scaling_driver facts [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) [14:22:34] (03CR) 10Jbond: facter - cpu_details: add governor and scaling_driver facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) (owner: 10Jbond) [14:23:25] RECOVERY - puppet last run on ms-be2043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:23:35] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:24:03] RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:24:03] RECOVERY - puppet last run on ms-be1038 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:24:11] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:24:49] RECOVERY - puppet last run on ms-be2047 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:24:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [14:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:45] RECOVERY - puppet last run on ms-be2037 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:27:01] RECOVERY - puppet last run on ms-be2031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:29:45] RECOVERY - puppet last run on ms-be1048 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:30:21] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:30:31] RECOVERY - puppet last run on ms-be1041 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:31:35] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:31:37] RECOVERY - puppet last run on ms-be2030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:32:23] RECOVERY - puppet last run on ms-be2046 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:32:57] RECOVERY - puppet last run on ms-be2048 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:33:05] RECOVERY - puppet last run on ms-be1042 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:33:43] RECOVERY - puppet last run on ms-be2041 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:33:43] RECOVERY - puppet last run on ms-be2039 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:34:09] RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:35:11] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [14:35:20] (03CR) 10Filippo Giunchedi: facter - cpu_details: add governor and scaling_driver facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) (owner: 10Jbond) [14:35:21] RECOVERY - puppet last run on ms-be1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:37:51] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:40:10] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10MoritzMuehlenhoff) I used the makevm cook book to create a pool counter VM and it worked great for me! One thing I'd sug... [14:40:25] RECOVERY - puppet last run on ms-be1046 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [14:41:07] RECOVERY - puppet last run on ms-be1029 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [14:41:53] RECOVERY - puppet last run on ms-be1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:42:05] RECOVERY - puppet last run on ms-be1036 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:42:22] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10elukey) >>! In T203963#5308744, @MoritzMuehlenhoff wrote: > I used the makevm cook book to create a pool counter VM and... [14:42:25] RECOVERY - puppet last run on ms-be2040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:45:05] RECOVERY - puppet last run on ms-be1037 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:49:01] (03PS5) 10Jbond: facter - cpu_details: add governor and scaling_driver facts [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) [14:50:14] (03CR) 10Jbond: "updated, thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) (owner: 10Jbond) [14:51:23] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [14:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:15] 10Operations, 10ops-codfw: lvs2002 possible broken BBU - https://phabricator.wikimedia.org/T223949 (10Papaul) p:05Normal→03Low [14:56:00] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10MoritzMuehlenhoff) Ah, and one more thing: After typing "done" for confirmation, output stalls for about ten minutes whi... [14:58:38] fsero: FYI run-puppet-agent has a --enable option ;) [14:58:50] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10MoritzMuehlenhoff) And one more thought/idea: Our reimage script requires to be run in screen/tmux, maybe that's a good... [15:01:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:31] 10Operations, 10netbox: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634 (10Volans) Most things were indeed fixed, I'm not sure on the status of the last 2 in the description checkboxes list. But they shouldn't affect anymore reboots/restarts but at most new in... [15:04:13] 10Operations, 10netbox: Error in postgres puppettization for new installation (was Netbox: postgres cannot be restarted w/ current config) - https://phabricator.wikimedia.org/T184634 (10Volans) p:05High→03Low [15:04:21] Ty volans for next time, anyhow I wonder why this happened. It seems puppet master was overloaded but I was under the impression than puppet agent runs within a cron with random delay to avoid it [15:05:07] fsero: that is if you only enable puppet, it will run on the crontab as usual [15:05:19] if you run run-puppet-agent you're forcing a run right now [15:05:34] (03PS1) 10Elukey: sre.ganeti.makevm: add dns check before creating the vm [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) [15:05:45] -1 coming in 3..2..1... [15:05:49] as documented here we should keep batch size low to not overload puppetmasters in those cases [15:05:50] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [15:05:52] https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [15:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:50] also why running bash from cumin? it seems totally unneded if I got what you wanted to do [15:07:05] but I'm just reading backlog, so I might miss context ;) [15:08:00] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10Papaul) Your request is being worked on under reference number 5339905554 Status: Case is generated and in Progress Product description: H... [15:08:36] I was in a hurry and wanted to run multiple commands and seemed the easiest way [15:12:51] 10Operations, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10akosiaris) [15:13:30] 10Operations, 10ops-eqiad: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10akosiaris) 05Open→03Resolved Moving the issue about the disks to T227335, resolving this one [15:15:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:47] (03PS1) 10Ladsgroup: statistics: Add wdqs host to wmde statistcs configuration [puppet] - 10https://gerrit.wikimedia.org/r/520901 (https://phabricator.wikimedia.org/T218710) [15:16:48] fsero: so, if you run 'foo && bar' or 'foo; bar' or 'foo || bar', that works as expected in a normal bash. If you want to specify multiple commands you need to use -m/--mode sync|async (one of the two) and specify multiple positional commands 'foo' 'bar' ... [15:17:04] (03CR) 10Volans: [C: 04-1] "Thanks for the patch! Some suggestions on how to better use Spicerack inline ;)" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [15:17:12] elukey: you asked for it :-P [15:17:18] ahahhahahahahah [15:17:56] basically there is not a single bit that is ok [15:17:59] goooood [15:18:11] basically your works too ;) [15:18:24] Nop volans I think I've explained myself badly storm of alerts appeared after a puppet merge without a forced puppet run [15:19:03] fsero: and was not because of a code error but just overload? [15:19:10] and what was the fix? [15:19:29] No code error without merging anything else just running puppet [15:19:37] Disable puppet for all swift hosts [15:19:43] And then reenable it gradually [15:19:56] You have some logs on #security as well [15:20:21] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [15:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:39] 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10Papaul) @fgiunchedi this server is out of warranty since October 2018. We have no 4TB disks on site. [15:21:49] we statistically have 1 puppet run per second (although not that granlarly distributed) [15:22:03] I'm wondering what that puppet run had special to cause this [15:23:44] !log restarting swift-container-sync on swift backends [15:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:49] I doubt it was really an overload, is only applied to a smallish subset of our fleet? if so, we'd see this more often? [15:24:36] also it failed only for that set of hosts, same cluster [15:24:42] then something triggered and error puppetmaster returned 500 [15:24:44] (03CR) 10Elukey: sre.ganeti.makevm: add dns check before creating the vm (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [15:24:45] 10Operations, 10ops-codfw: lvs2002 possible broken BBU - https://phabricator.wikimedia.org/T223949 (10Papaul) 05Open→03Resolved This is a duplicate of T213417 so closing it. [15:24:55] but i cannot offer you more data sorry :/ [15:24:59] so I'm tempted to say it was related to the puppet patch [15:25:09] 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10fgiunchedi) >>! In T225630#5308916, @Papaul wrote: > @fgiunchedi this server is out of warranty since October 2018. We have no 4TB disks on site. Ok! I'd like to request ordering of 4TB disk (or multiple... [15:25:24] you can check out the patch was a really simple patch [15:25:26] I can have a quick look at puppetboard and maybe try to compile the catalog with and without the patch [15:25:45] sure [15:25:58] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Done with CPT), and 4 others: Session storage Cassandra cluster configuration - https://phabricator.wikimedia.org/T215883 (10WDoranWMF) [15:27:01] 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10Papaul) @fgiunchedi Please open a procurement task in that case. Thanks. [15:27:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) (owner: 10Jbond) [15:28:21] (03CR) 10Jbond: [C: 03+2] facter - cpu_details: add governor and scaling_driver facts [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) (owner: 10Jbond) [15:28:33] (03PS6) 10Jbond: facter - cpu_details: add governor and scaling_driver facts [puppet] - 10https://gerrit.wikimedia.org/r/520885 (https://phabricator.wikimedia.org/T225713) [15:30:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:09] 10Operations, 10procurement: codfw: spare 4TB disks for ms-be hosts - https://phabricator.wikimedia.org/T227337 (10fgiunchedi) [15:32:48] !log uploaded debian buster base docker image [15:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:43] (03PS2) 10Elukey: sre.ganeti.makevm: add dns check before creating the vm [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) [15:37:32] (03PS3) 10Elukey: sre.ganeti.makevm: add dns check before creating the vm [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) [15:39:21] (03PS1) 10Muehlenhoff: Add DHCP entries for poolcounter100[45], poolcounter200[34] [puppet] - 10https://gerrit.wikimedia.org/r/520906 (https://phabricator.wikimedia.org/T226811) [15:39:56] (03PS9) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [15:40:24] (03PS1) 10Paladox: Gerrit: Wrap