[00:00:04] Deploy window NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190705T0000) [04:52:55] (03PS2) 10Marostegui: Phabricator: Set taskmasters to 4 [puppet] - 10https://gerrit.wikimedia.org/r/520770 (https://phabricator.wikimedia.org/T227251) (owner: 1020after4) [04:53:25] (03CR) 10Marostegui: [C: 03+2] Phabricator: Set taskmasters to 4 [puppet] - 10https://gerrit.wikimedia.org/r/520770 (https://phabricator.wikimedia.org/T227251) (owner: 1020after4) [05:01:27] (03PS1) 10Marostegui: mariadb: Decommission db1069 [puppet] - 10https://gerrit.wikimedia.org/r/520828 (https://phabricator.wikimedia.org/T227166) [05:02:27] !log Remove db1069 from tendril and zarcillo - T227166 [05:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:33] T227166: decommission db1069 - https://phabricator.wikimedia.org/T227166 [05:04:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1069 [puppet] - 10https://gerrit.wikimedia.org/r/520828 (https://phabricator.wikimedia.org/T227166) (owner: 10Marostegui) [05:08:19] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.makevm [05:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:58] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.makevm [05:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:42] !log Stop MySQL on db1069 for decommission T227166 [05:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:47] T227166: decommission db1069 - https://phabricator.wikimedia.org/T227166 [05:11:10] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:17:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [05:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520830 (https://phabricator.wikimedia.org/T227062) [05:18:39] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [05:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:39] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520830 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:21:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520830 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:21:51] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) >>! In T226952#5295368, @Marostegui wrote: > Note: db2044 needs upgrading This was done [05:22:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520830 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:22:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 for upgrade (duration: 00m 51s) [05:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:23] !log Upgrade db1104 T227062 [05:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:28] T227062: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 [05:35:58] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10Marostegui) >>! In T227251#5306948, @mmodell wrote: > Now the graphs look better. Unfortunately, puppet will set the config back to 10 taskmasters unless we m... [05:38:25] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520833 [05:41:04] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review: Data model for dbconfig - https://phabricator.wikimedia.org/T197531 (10Marostegui) @Joe @CDanis is this task still valid? [05:41:24] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520833 (owner: 10Marostegui) [05:42:17] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520833 (owner: 10Marostegui) [05:42:39] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520833 (owner: 10Marostegui) [05:42:49] (03PS1) 10Vgutierrez: install_server: Add DHCP entries for ncredir[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/520836 (https://phabricator.wikimedia.org/T133548) [05:43:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1104 after upgrade (duration: 00m 49s) [05:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:59] (03CR) 10Vgutierrez: [C: 03+2] install_server: Add DHCP entries for ncredir[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/520836 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [05:46:53] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520838 [05:54:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520838 (owner: 10Marostegui) [05:55:00] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520838 (owner: 10Marostegui) [05:56:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1104 after upgrade (duration: 00m 49s) [05:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:24] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520838 (owner: 10Marostegui) [05:57:28] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [05:57:39] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Introduce the concept of shared certificates [puppet] - 10https://gerrit.wikimedia.org/r/517660 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [05:57:49] (03PS7) 10Vgutierrez: acme_chief: Introduce the concept of shared certificates [puppet] - 10https://gerrit.wikimedia.org/r/517660 (https://phabricator.wikimedia.org/T133548) [06:01:33] (03PS25) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [06:09:44] (03PS1) 10Vgutierrez: hieradata: Grant ncredir instances access to the ncredir certificates [puppet] - 10https://gerrit.wikimedia.org/r/520840 (https://phabricator.wikimedia.org/T133548) [06:14:01] (03PS1) 10Vgutierrez: site: Add ncredir[12]001 instances definition [puppet] - 10https://gerrit.wikimedia.org/r/520841 (https://phabricator.wikimedia.org/T133548) [06:17:21] (03CR) 10Jcrespo: "> Good point, I think that might have applied only to Prometheus 1. IMHO worth trying not force creation of empty files while we're at it " [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [06:18:28] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Provide support for nginx in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/513279 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [06:18:38] (03PS4) 10Vgutierrez: redirects.dat: Provide support for nginx in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/513279 (https://phabricator.wikimedia.org/T224539) [06:19:14] (03PS5) 10Jcrespo: mariadb: Prepare core for buster [puppet] - 10https://gerrit.wikimedia.org/r/519073 (https://phabricator.wikimedia.org/T193224) [06:19:16] (03PS14) 10Jcrespo: prometheus-mysqld-exporter: Automate targets based on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) [06:22:01] 10Operations, 10Traffic: Provide nginx support in compile_redirects() - https://phabricator.wikimedia.org/T224539 (10Vgutierrez) 05Open→03Resolved [06:22:08] 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Vgutierrez) [06:24:45] (03PS1) 10Marostegui: db1109: Convert it to candidate master [puppet] - 10https://gerrit.wikimedia.org/r/520842 (https://phabricator.wikimedia.org/T227062) [06:25:21] (03CR) 10Marostegui: [C: 03+2] db1109: Convert it to candidate master [puppet] - 10https://gerrit.wikimedia.org/r/520842 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [06:27:57] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) >>! In T227288#5307228, @MoritzMuehlenhoff wrote: > Should these really be both in eqiad? The initial use case is for analytics, but we migh... [06:30:31] (03CR) 10Elukey: [C: 03+1] Update a number of comments still referring to Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/520764 (owner: 10Muehlenhoff) [06:32:40] (03PS1) 10Marostegui: db-codfw.php: Clean up old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520843 [06:32:44] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:33:51] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Clean up old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520843 (owner: 10Marostegui) [06:34:41] (03Merged) 10jenkins-bot: db-codfw.php: Clean up old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520843 (owner: 10Marostegui) [06:35:49] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove old comments (duration: 00m 50s) [06:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:12] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 [06:36:32] (03CR) 10jenkins-bot: db-codfw.php: Clean up old comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520843 (owner: 10Marostegui) [06:36:35] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review: Data model for dbconfig - https://phabricator.wikimedia.org/T197531 (10Volans) 05Open→03Resolved a:03Volans The data model is now part of the software and will evolve with it, wikitech documentation will be provided for it. I'm resol... [06:36:41] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10Volans) [06:38:18] (03CR) 10Volans: [C: 03+2] Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519752 (owner: 10Volans) [06:40:55] (03Merged) 10jenkins-bot: Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519752 (owner: 10Volans) [06:40:57] (03PS4) 10Jcrespo: Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 [06:42:27] (03CR) 10Jcrespo: [C: 03+1] Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 (owner: 10Jcrespo) [06:43:28] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 (owner: 10Jcrespo) [06:43:50] (03CR) 10Volans: [C: 03+2] debian: Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519753 (owner: 10Volans) [06:44:25] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 (owner: 10Jcrespo) [06:46:08] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1109 with full weight (duration: 00m 49s) [06:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:17] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10Volans) All patches for v1 of dbconfig are merged, including the ones to make a new conftool releas... [06:46:25] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1109 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520737 (owner: 10Jcrespo) [06:46:27] (03Merged) 10jenkins-bot: debian: Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519753 (owner: 10Volans) [06:48:39] (03PS1) 10Jcrespo: mariadb: Depool db1087 (s8 sanitarium master) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520844 [06:52:06] 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) The PR is still waiting for the second upstream review, since there is no real rush I'd prefer to wait for t... [06:58:52] 10Operations, 10Goal, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197 (10elukey) [06:59:19] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274 (10elukey) [06:59:54] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:01:10] 10Operations, 10Analytics, 10EventBus, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048 (10elukey) 05Open→03Declined Eventbus is on its road to decommission in favor of event-gate, I'd close this task since probably not relevant any... [07:02:28] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1087 (s8 sanitarium master) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520844 (owner: 10Jcrespo) [07:05:20] (03PS2) 10Muehlenhoff: prometheus-snmp-exporter: Switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/520773 (https://phabricator.wikimedia.org/T194724) [07:08:53] (03CR) 10Muehlenhoff: [C: 03+2] prometheus-snmp-exporter: Switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/520773 (https://phabricator.wikimedia.org/T194724) (owner: 10Muehlenhoff) [07:10:46] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1087 (s8 sanitarium master) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520844 (owner: 10Jcrespo) [07:11:01] (03Merged) 10jenkins-bot: mariadb: Depool db1087 (s8 sanitarium master) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520844 (owner: 10Jcrespo) [07:11:03] (03CR) 10jenkins-bot: mariadb: Depool db1087 (s8 sanitarium master) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520844 (owner: 10Jcrespo) [07:13:18] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1087 (duration: 00m 52s) [07:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:23] (03PS1) 10Muehlenhoff: Revert "prometheus-snmp-exporter: Switch to systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/520845 [07:16:13] (03CR) 10jerkins-bot: [V: 04-1] Revert "prometheus-snmp-exporter: Switch to systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/520845 (owner: 10Muehlenhoff) [07:16:16] PROBLEM - puppet last run on netmon1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:17:24] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Revert "prometheus-snmp-exporter: Switch to systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/520845 (owner: 10Muehlenhoff) [07:17:52] !log Compress small wikis on labsdb1009 T222978 [07:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:57] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [07:19:56] 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) Tried to check in /var/log/apt/history the packages installed to make the Tensorflow and Thumbor (uses OpenCL) use case working: ` cxlactivitylogger hcc hsa-rocr-dev... [07:21:42] RECOVERY - puppet last run on netmon1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:23:40] !log installing wireshark security updates on jessie [07:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:48] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1087 (s8 sanitarium master) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520847 [07:33:38] 10Operations, 10Analytics, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) Also there seems to be some movement in Debian for rocm: https://lists.debian.org/debian-devel/2019/06/msg00302.html [07:35:25] !log installing imagemagick security updates on jessie [07:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:12] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1087 (s8 sanitarium master) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520847 [07:49:50] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1087 (s8 sanitarium master) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520847 (owner: 10Jcrespo) [07:51:03] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1087 (s8 sanitarium master) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520847 (owner: 10Jcrespo) [07:51:05] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1087 (s8 sanitarium master) for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520847 (owner: 10Jcrespo) [07:57:07] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1087 (duration: 00m 48s) [07:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:14] (03PS1) 10Elukey: aptrepo: add component/amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) [08:33:55] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) After ~20h `ms-be2037` of running with "os control" set and `powersave` governor seems to behave fine. Compared to `performance` cpu load is slightly higher as expected and temperature slightl... [08:42:16] (03PS1) 10Jcrespo: Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 [08:42:49] (03CR) 10jerkins-bot: [V: 04-1] Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [08:46:32] (03PS2) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [08:46:34] (03PS2) 10Jcrespo: Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 [08:46:55] (03CR) 10jerkins-bot: [V: 04-1] replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 (owner: 10Jcrespo) [08:46:59] (03CR) 10jerkins-bot: [V: 04-1] Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [08:47:29] (03PS3) 10Jcrespo: Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 [08:49:38] (03PS3) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [08:50:01] (03CR) 10jerkins-bot: [V: 04-1] replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 (owner: 10Jcrespo) [08:51:59] (03PS1) 10Muehlenhoff: Add library hint for postgres [puppet] - 10https://gerrit.wikimedia.org/r/520852 [08:53:21] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for postgres [puppet] - 10https://gerrit.wikimedia.org/r/520852 (owner: 10Muehlenhoff) [08:53:45] a heads up, the VM running irc.wikimedia.org will be rebooted in about ten minutes for a security update (all clients have been automatically reconnecting in the past) [08:54:05] (03CR) 10Marostegui: [C: 03+1] "<3 thanks!" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [08:54:53] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [08:54:54] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:51] ACKNOWLEDGEMENT - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100% Gehel tracked on https://phabricator.wikimedia.org/T227298 [09:00:47] (03PS2) 10Gehel: cloudelastic: use the proper check for SSL certificates [puppet] - 10https://gerrit.wikimedia.org/r/520782 [09:01:13] !log rebooting kraz (irc.wikimedia.org) to pick up MDS-enabled qemu [09:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:36] (03CR) 10Gehel: [C: 03+2] cloudelastic: use the proper check for SSL certificates [puppet] - 10https://gerrit.wikimedia.org/r/520782 (owner: 10Gehel) [09:06:51] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) Continuing the fleetwide audit, my impression is that unless explicitly set by puppet the governor should be `powersave`, thus the hosts that currently don't have that are: == Dell == `cumin... [09:12:03] (03CR) 10Jcrespo: "./switchover.py es2002 es2001 --read-only-master" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [09:12:36] (03CR) 10Jcrespo: "Feel also free to criticize the wording for each one, as I have run of creativity." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [09:14:06] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10hashar) I am not sure whether it is related, but a month or so ago I have noticed that the old cloudvirt machines to have poor CPU performance for an unknown reason yet. We have made a benchmark on labte... [09:15:03] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=codfw,name=elastic2054.codfw.wmnet [09:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:24] (03CR) 10Marostegui: [C: 03+1] "> Feel also free to criticize the wording for each one, as I have run" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [09:25:02] 10Operations: Support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T202255 (10MoritzMuehlenhoff) 05Open→03Declined We swapped the NICs in these servers to a model supported by 4.9 (Broadcom BCM57412) and for any new deployments we can use Buster which has a 4.19 k... [09:26:16] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Gehel) elastic2054 is down again. It is set to pooled=inactive, and marked as failed in netbox. @Papaul: it looks like this is going to need your help. You can do whatever you need with this server and reboot i... [09:28:31] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:28:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:05] !log rebooting LDAP replicas in eqiad [09:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:26] (03PS1) 10Ema: cache: reimage cp1086 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520860 (https://phabricator.wikimedia.org/T226638) [09:39:57] !log depool cp1086 and reimage as upload_ats T226638 [09:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:01] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [09:41:14] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) In terms of scaling drivers, here's the list of hosts that don't have `intel_pstate` (which AIUI is what we want to use) `cumin -p99 -b100 'F:virtual ~ physical' 'cat /sys/devices/system/cpu/... [09:41:30] (03CR) 10Ema: [C: 03+1] "Seems fine and pcc agrees https://puppet-compiler.wmflabs.org/compiler1001/17238/" [puppet] - 10https://gerrit.wikimedia.org/r/520774 (owner: 10Muehlenhoff) [09:42:00] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) >>! In T225713#5307833, @hashar wrote: > I am not sure whether it is related, but a month or so ago I have noticed that the old cloudvirt machines to have poor CPU performance for an unknown r... [09:42:38] (03CR) 10Ema: [C: 03+2] cache: reimage cp1086 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520860 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [09:45:28] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1086.eqiad.wmnet'] ` The log can be found in `... [09:47:11] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10elukey) The two remaining use cases are: * labswiki * thumbor The latter should be doable, but the former seems a bit more complicated... [09:48:41] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:48:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:15] !log rebooting serpens to pick up MDS-enabled qemu [09:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:42] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) @fgiunchedi I noticed that node_network_transmit_bytes_total is already used for swift in puppet, do you have any sugg... [09:52:44] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) a:03Qgil [10:00:29] !log rebooting seaborgium to pick up MDS-enabled qemu [10:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:52] !log Rolling rebood rdb* hosts - T227304 [10:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:57] T227304: Reboot rdb* cluster - https://phabricator.wikimedia.org/T227304 [10:06:23] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:06:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:01] (03CR) 10Vgutierrez: [C: 03+2] hieradata: Grant ncredir instances access to the ncredir certificates [puppet] - 10https://gerrit.wikimedia.org/r/520840 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:09:10] (03PS2) 10Vgutierrez: hieradata: Grant ncredir instances access to the ncredir certificates [puppet] - 10https://gerrit.wikimedia.org/r/520840 (https://phabricator.wikimedia.org/T133548) [10:14:08] (03CR) 10Jbond: [V: 03+2 C: 03+2] "plus 2" [labs/private] - 10https://gerrit.wikimedia.org/r/520776 (owner: 10Jbond) [10:14:37] !log fixed up kernel packages on serpens/seaborgium, these were dist-upgraded from jessie, but the correct kernel packages for Stretch were not setup, as such there were still stuck with an old jessie kernel [10:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:34] !log rebooting serpens to pick up correct Stretch kernel [10:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:25] (03PS2) 10Vgutierrez: site: Add ncredir[12]001 instances definition [puppet] - 10https://gerrit.wikimedia.org/r/520841 (https://phabricator.wikimedia.org/T133548) [10:18:26] (03PS1) 10Vgutierrez: install_server: Add disk layout for ncredir[12]001 instances [puppet] - 10https://gerrit.wikimedia.org/r/520865 (https://phabricator.wikimedia.org/T133548) [10:19:37] (03CR) 10Vgutierrez: [C: 03+2] install_server: Add disk layout for ncredir[12]001 instances [puppet] - 10https://gerrit.wikimedia.org/r/520865 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:23:10] (03CR) 10Vgutierrez: [C: 03+2] site: Add ncredir[12]001 instances definition [puppet] - 10https://gerrit.wikimedia.org/r/520841 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:23:12] !log rebooting seaborgium to pick up correct Stretch kernel [10:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:39] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:23:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:22] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:29:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:45] !log rebooting debug proxies to pick up MDS-enabled qemu [10:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:02] PROBLEM - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 354 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [10:31:13] PROBLEM - LVS HTTP IPv4 on docker-registry.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 354 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:31:15] ^ expected [10:31:18] PROBLEM - Docker registry health on registry2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 235 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [10:31:31] ^ expected [10:31:34] PROBLEM - Docker registry HTTPS interface on registry2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2002.codfw.wmnet:443/v2/wikimedia-stretch/manifests/latest - 354 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Docker [10:31:44] PROBLEM - Docker registry HTTPS interface on registry2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2001.codfw.wmnet:443/v2/wikimedia-stretch/manifests/latest - 354 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Docker [10:31:50] PROBLEM - Docker registry health on registry2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 235 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [10:32:00] kk, thanks jijiki [10:32:21] k (that was a page) [10:32:30] RECOVERY - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 292 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [10:32:41] RECOVERY - LVS HTTP IPv4 on docker-registry.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 292 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:32:46] RECOVERY - Docker registry health on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [10:33:02] RECOVERY - Docker registry HTTPS interface on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 2545 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Docker [10:33:10] RECOVERY - Docker registry HTTPS interface on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 2545 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Docker [10:33:18] RECOVERY - Docker registry health on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker [10:33:28] indeed, the rdb hosts reboot triggered registry failure ? [10:34:03] yes [10:34:07] from logs [10:34:11] https://www.irccloud.com/pastebin/0PLYj8sc/ [10:34:13] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Qgil) [10:34:21] that is unexpected i'll fill a task [10:36:39] ok then it was expected from my POV:p [10:36:55] ack, thanks [10:52:40] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1086.eqiad.wmnet'] ` and were **ALL** successful. [10:55:15] !log pool cp1086 w/ ATS backend T226638 [10:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:21] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [10:59:10] (03PS1) 10Ema: cache: reimage cp1088 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520867 (https://phabricator.wikimedia.org/T226638) [11:00:11] !log depool cp1088 and reimage as upload_ats T226638 [11:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:19] (03CR) 10Ema: [C: 03+2] cache: reimage cp1088 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520867 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [11:02:47] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10fgiunchedi) >>! In T224454#5307968, @elukey wrote: > @fgiunchedi I noticed that node_network_transmit_bytes_total is already u... [11:04:11] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1088.eqiad.wmnet'] ` The log can be found in `... [11:04:57] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [11:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:09] !log jmm@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [11:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:47] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [11:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:56] !log jmm@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [11:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:51] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [11:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:01] !log jmm@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [11:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:01] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10fgiunchedi) [11:19:13] 10Operations: creation of prometheus_puppet_agent_stats fails on first puppet run - https://phabricator.wikimedia.org/T227315 (10Vgutierrez) [11:21:56] (03PS1) 10Vgutierrez: prometheus: Fix prometheus_puppet_agent_stats dependencies [puppet] - 10https://gerrit.wikimedia.org/r/520869 (https://phabricator.wikimedia.org/T227315) [11:26:32] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Fix prometheus_puppet_agent_stats dependencies [puppet] - 10https://gerrit.wikimedia.org/r/520869 (https://phabricator.wikimedia.org/T227315) (owner: 10Vgutierrez) [11:28:23] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Fix prometheus_puppet_agent_stats dependencies [puppet] - 10https://gerrit.wikimedia.org/r/520869 (https://phabricator.wikimedia.org/T227315) (owner: 10Vgutierrez) [11:28:36] (03PS2) 10Vgutierrez: prometheus: Fix prometheus_puppet_agent_stats dependencies [puppet] - 10https://gerrit.wikimedia.org/r/520869 (https://phabricator.wikimedia.org/T227315) [11:31:02] !log installing postgresql-9.4 updates on jessie [11:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:59] 10Operations: creation of prometheus_puppet_agent_stats fails on first puppet run - https://phabricator.wikimedia.org/T227315 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Normal a:03Vgutierrez [11:32:08] that was fast.. [11:32:17] !log Upgrading smartarray firmware on ms-be1021 - T141756 - T227076 [11:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:24] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [11:32:24] T227076: Upgrade firmware on ms-be1021 (Was: Degraded RAID on ms-be1021) - https://phabricator.wikimedia.org/T227076 [11:33:13] vgutierrez: \o/ [11:33:16] thank you [11:33:19] np :D [11:35:03] (03PS1) 10Ema: cache: reimage cp1090 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520870 (https://phabricator.wikimedia.org/T226638) [11:38:26] !log Reboot ms-be1021 - T141756 - T227076 [11:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:32] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [11:38:32] T227076: Upgrade firmware on ms-be1021 (Was: Degraded RAID on ms-be1021) - https://phabricator.wikimedia.org/T227076 [11:39:40] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:39:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:14] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1088.eqiad.wmnet'] ` and were **ALL** successful. [11:46:56] !log pool cp1088 w/ ATS backend T226638 [11:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:01] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [11:47:26] 10Operations, 10ops-eqiad, 10serviceops: Upgrade firmware on ms-be1021 (Was: Degraded RAID on ms-be1021) - https://phabricator.wikimedia.org/T227076 (10jijiki) 05Open→03Resolved a:03jijiki There are still messages like ` [ 122.753602] perf: interrupt took too long (2953 > 2500), lowering kernel.per... [11:56:31] (03PS1) 10Ema: cache_upload: remove varnish from frontend::backend_services [puppet] - 10https://gerrit.wikimedia.org/r/520872 (https://phabricator.wikimedia.org/T226589) [12:02:28] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10MoritzMuehlenhoff) >>! In T227288#5307686, @elukey wrote: > This is a very good point. Would we have only one KDC per datacenter? I think having o... [12:02:40] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10MoritzMuehlenhoff) p:05Triage→03Normal [12:02:48] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10ema) @Gilles: is there anything left to be done here? Other than blogging about the results that is. :-) [12:03:56] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10MoritzMuehlenhoff) p:05Triage→03Normal [12:05:18] 10Operations, 10Analytics, 10Traffic: Increased number of webrequest sequence-numbers alarms (mostly) on upload webrequest-source - https://phabricator.wikimedia.org/T225786 (10ema) [12:06:05] (03PS4) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [12:06:18] (03CR) 10jerkins-bot: [V: 04-1] replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 (owner: 10Jcrespo) [12:09:11] (03PS1) 10Ema: ATS: do not add Server: header [puppet] - 10https://gerrit.wikimedia.org/r/520875 (https://phabricator.wikimedia.org/T224119) [12:09:16] (03PS5) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [12:12:03] !log depool cp1090 and reimage as upload_ats T226638 [12:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:08] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [12:13:08] (03CR) 10Ema: [C: 03+2] cache: reimage cp1090 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520870 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [12:14:38] (03Abandoned) 10Hashar: cassandra: fix spec service provider [puppet] - 10https://gerrit.wikimedia.org/r/503996 (owner: 10Hashar) [12:15:17] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1090.eqiad.wmnet'] ` The log can be found in `... [12:18:28] (03PS2) 10Jbond: Remove support for Ubuntu from os_version and related tests [puppet] - 10https://gerrit.wikimedia.org/r/520765 (owner: 10Muehlenhoff) [12:23:31] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10hashar) [12:24:16] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10hashar) Seems `krypton.eqiad.wmnet` is still using Jessie / php5.6. We could use an upgrade to Stretch to drop php5.6 support from the CI infrastructure :-] [12:24:46] (03CR) 10Muehlenhoff: [C: 04-1] aptrepo: add component/amd-rocm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [12:31:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks fine, besides the fleet-wide default ports like SSH, the only thing externally reachable in need of a ferm rule is rsyncd, but stati" [puppet] - 10https://gerrit.wikimedia.org/r/520706 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [12:31:46] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10MoritzMuehlenhoff) p:05High→03Normal [12:39:26] 10Operations: HP Gen9 onboard controller review - https://phabricator.wikimedia.org/T216175 (10MoritzMuehlenhoff) I saw this task during clinic duty and I'm wondering what/if there's anything left to be done? S100i SR SW RAID seems to be about some HP software offering for Windows to run a software RAID, but we... [12:41:55] 10Operations, 10netbox: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634 (10MoritzMuehlenhoff) Is rebooted the Netbox hosts (1002, 2001) for the MDS kernel issues this week and that does not seem to be an issue any more. Can this bug be closed or is there anyth... [12:43:39] 10Operations, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10MoritzMuehlenhoff) @Vgutierrez The firmware update on the NICs fixed this for good, right? Can we close this task? [12:46:48] 10Operations, 10Patch-For-Review: logrotate for visualdiff tests on Parsoid test host (scandium) - https://phabricator.wikimedia.org/T161920 (10MoritzMuehlenhoff) [12:47:00] 10Operations, 10vm-requests: Site: eqiad/codfw 2 VMs each for pool counters - https://phabricator.wikimedia.org/T226811 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [12:50:29] 10Operations, 10Phabricator: Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts - https://phabricator.wikimedia.org/T227251 (10Marostegui) 05Open→03Resolved a:03mmodell Just to clarify, we have lowered the priority because the slaves are no longer lagging. A few minutes ago the... [12:52:21] (03CR) 10Elukey: aptrepo: add component/amd-rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [12:53:11] 10Operations, 10Wikimedia-Mailing-lists: LGBT mailing list moderator password reset - https://phabricator.wikimedia.org/T225787 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I'm marking this as resolved, please reopen if anything else needs to be done. [12:53:32] (03PS2) 10Elukey: aptrepo: add component/amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) [12:56:29] 10Operations, 10ops-codfw: ms-be2018 sdc unreadable sector - https://phabricator.wikimedia.org/T225630 (10fgiunchedi) a:03Papaul @Papaul please order / replace this disk when you get a chance! [12:58:30] (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: add component/amd-rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [12:59:14] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) Makes sense, the extra latency to codfw shouldn't be a big deal. I know that we need to have only one kadmin server, but I was thinking abou... [13:01:16] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1090.eqiad.wmnet'] ` and were **ALL** successful. [13:03:06] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) >>! In T224454#5308149, @fgiunchedi wrote: >>>! In T224454#5307968, @elukey wrote: >> @fgiunchedi I noticed that node_... [13:04:16] (03CR) 10Elukey: aptrepo: add component/amd-rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [13:05:01] (03PS3) 10Elukey: aptrepo: add component/amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) [13:05:25] !log pool cp1090 w/ ATS backend T226638 [13:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:31] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [13:06:09] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ema) 05Open→03Resolved a:03ema With the conversion of cp1090 this is now done. [13:06:12] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes - https://phabricator.wikimedia.org/T226589 (10ema) [13:06:36] (03Abandoned) 10Ema: Revert "Normalize thumbnail URLs to avoid cachebusting" [puppet] - 10https://gerrit.wikimedia.org/r/518231 (owner: 10Ema) [13:06:55] (03Abandoned) 10Ema: package_builder: move lintian out of require_package [puppet] - 10https://gerrit.wikimedia.org/r/506679 (owner: 10Ema) [13:07:12] (03Abandoned) 10Ema: ATS: log cache results and backend URL [puppet] - 10https://gerrit.wikimedia.org/r/477245 (owner: 10Ema) [13:10:13] (03PS7) 10Fsero: adding a buster docker base image [puppet] - 10https://gerrit.wikimedia.org/r/520503 [13:11:45] (03CR) 10Fsero: [C: 03+2] adding a buster docker base image [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [13:20:30] 10Operations, 10ops-codfw: lvs2002 possible broken BBU - https://phabricator.wikimedia.org/T223949 (10MoritzMuehlenhoff) a:03Papaul [13:22:49] 10Operations, 10ops-eqiad, 10Traffic: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10ema) 05Open→03Resolved a:03ema The host has been in production for weeks without issues now. Closing. [13:23:27] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:23:45] 10Operations, 10cloud-services-team: Investigate use of hp-asrd on HPE servers - https://phabricator.wikimedia.org/T221939 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:25:30] (03PS3) 10Fsero: registry: improving swift replication [puppet] - 10https://gerrit.wikimedia.org/r/519018 [13:25:39] (03CR) 10Fsero: [C: 03+2] registry: improving swift replication [puppet] - 10https://gerrit.wikimedia.org/r/519018 (owner: 10Fsero) [13:26:26] (03PS2) 10Ema: cache_upload: remove varnish from frontend::backend_services [puppet] - 10https://gerrit.wikimedia.org/r/520872 (https://phabricator.wikimedia.org/T226589) [13:26:57] !log restarting swift-container-sync on swift backends [13:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:13] (03CR) 10Ema: [C: 03+2] cache_upload: remove varnish from frontend::backend_services [puppet] - 10https://gerrit.wikimedia.org/r/520872 (https://phabricator.wikimedia.org/T226589) (owner: 10Ema) [13:28:03] ema safe to merge? [13:28:22] if no you can merge mine at your convenience [13:28:22] fsero: yes, please go ahead [13:28:35] done ty [13:29:57] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:29:58] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:29:58] PROBLEM - puppet last run on ms-be1038 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:01] PROBLEM - puppet last run on ms-be1046 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:07] PROBLEM - puppet last run on ms-be2050 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:13] PROBLEM - puppet last run on ms-be1048 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:15] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:22] er/ [13:30:23] PROBLEM - puppet last run on ms-be1045 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:24] err [13:30:35] PROBLEM - puppet last run on ms-be1044 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:42] what are those ms-be.* servers for? [13:30:43] PROBLEM - puppet last run on ms-be2044 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:45] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:45] PROBLEM - puppet last run on ms-be2047 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:45] PROBLEM - puppet last run on ms-be1029 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:47] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:30:54] godog: ^ [13:30:55] hauskatze: swift back end [13:30:57] PROBLEM - puppet last run on ms-be1041 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:01] PROBLEM - puppet last run on ms-be2033 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:06] apergos: ack, thanks for explaining :) [13:31:13] (03PS1) 10Milimetric: Update Mediawiki Reduced snapshot for AQS [puppet] - 10https://gerrit.wikimedia.org/r/520884 [13:31:23] PROBLEM - puppet last run on ms-be1050 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:26] looking into that [13:31:31] PROBLEM - puppet last run on ms-be1036 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:36] fsero: any chance this is you? [13:31:38] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:40] (03CR) 10Muehlenhoff: [C: 04-1] aptrepo: add component/amd-rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [13:31:47] PROBLEM - puppet last run on ms-be2040 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:48] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:31:49] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:15] PROBLEM - puppet last run on ms-be2029 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:21] PROBLEM - puppet last run on ms-be2038 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:23] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:29] PROBLEM - puppet last run on ms-be2037 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:41] PROBLEM - puppet last run on ms-be2046 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:47] PROBLEM - puppet last run on ms-be2031 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:47] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:32:57] PROBLEM - puppet last run on ms-be2036 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:15] PROBLEM - puppet last run on ms-be2048 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:17] PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:18] apergos: maybe [13:33:21] lookint into it [13:33:23] PROBLEM - puppet last run on ms-be1042 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:27] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:28] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:28] PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:31] ty [13:33:41] PROBLEM - puppet last run on ms-be1040 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:51] PROBLEM - puppet last run on ms-be1049 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:33:55] !log disabling puppet on swift backends [13:33:57]