[00:20:31] (03PS1) 10Jforrester: Retry "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 [00:21:22] (03CR) 10jerkins-bot: [V: 04-1] Retry "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (owner: 10Jforrester) [01:55:23] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Eevans) >>! In T227408#5364314, @jijiki wrote: > @Eevans I was under the impression we have more work to be done on the server. Shall we mark this task as resolved? I was under that impression t... [02:00:09] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:01:39] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:24:19] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:26:05] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [02:46:45] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 31324192 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:49:59] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 76144 and 86 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:08:44] (03CR) 10Eevans: table-properties: Initial commit (034 comments) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [03:23:23] PROBLEM - puppet last run on wtp1048 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:30:23] PROBLEM - puppet last run on an-worker1080 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:35:11] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:51:31] RECOVERY - puppet last run on wtp1048 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:58:31] RECOVERY - puppet last run on an-worker1080 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:03:17] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:12:19] (03CR) 10Jforrester: [C: 04-2] "This broke last time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (owner: 10Jforrester) [04:13:55] (03PS2) 10Jforrester: Retry "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) [04:14:49] (03CR) 10jerkins-bot: [V: 04-1] Retry "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [04:52:03] (03PS6) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 (https://phabricator.wikimedia.org/T228805) [04:52:05] (03PS1) 10Andrew Bogott: Move cloudvirt1016 and 1017 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/525712 (https://phabricator.wikimedia.org/T228692) [04:55:09] (03PS2) 10Andrew Bogott: Move cloudvirt1016 and 1017 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/525712 (https://phabricator.wikimedia.org/T228692) [04:55:11] (03PS7) 10Andrew Bogott: puppet: add facter.conf and cache some facts [puppet] - 10https://gerrit.wikimedia.org/r/525149 (https://phabricator.wikimedia.org/T228805) [04:56:26] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudvirt1016 and 1017 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/525712 (https://phabricator.wikimedia.org/T228692) (owner: 10Andrew Bogott) [04:57:24] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525557 (owner: 10Marostegui) [04:58:17] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525557 (owner: 10Marostegui) [04:58:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525557 (owner: 10Marostegui) [05:00:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1096 (duration: 00m 49s) [05:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:16] !log Stop MySQL on db1096 for upgrade [05:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:41] (03PS1) 10Marostegui: maintain-views: Remove afl_log_id [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) [05:13:40] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) p:05Normal→03High MySQL crashed again: ` [Thu Jul 25 16:10:27 2019] mce: Uncorrected hardware memory error in user-access at 336c902080 [Thu Jul 25 16:10:27 2019] {1}Hardware error d... [05:26:30] 10Operations, 10DBA, 10Phabricator, 10User-notice: Switchover m3 (phabricator) master db1072 to db1128 - https://phabricator.wikimedia.org/T228243 (10Marostegui) [05:29:17] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525715 [05:34:34] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525715 (owner: 10Marostegui) [05:35:24] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525715 (owner: 10Marostegui) [05:36:22] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525715 (owner: 10Marostegui) [05:36:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1096 (duration: 00m 48s) [05:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:26] 10Operations, 10DBA, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Marostegui) [05:40:43] !log Stop MySQL on db1072 to get it ready for decommission - T228956 [05:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:50] T228956: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 [05:42:16] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525716 [05:43:14] (03PS1) 10Andrew Bogott: Update 'eth1' names for cloudvirt1016 and 1017 [puppet] - 10https://gerrit.wikimedia.org/r/525717 [05:43:14] 10Operations, 10DBA, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Marostegui) [05:43:22] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525716 (owner: 10Marostegui) [05:43:53] (03CR) 10Andrew Bogott: [C: 03+2] Update 'eth1' names for cloudvirt1016 and 1017 [puppet] - 10https://gerrit.wikimedia.org/r/525717 (owner: 10Andrew Bogott) [05:44:16] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525716 (owner: 10Marostegui) [05:45:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1096 (duration: 00m 46s) [05:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:19] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525716 (owner: 10Marostegui) [05:48:25] (03PS5) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) [05:49:21] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [05:55:29] (03PS1) 10Marostegui: mariadb: Provision db2123 into s5 codfw [puppet] - 10https://gerrit.wikimedia.org/r/525720 (https://phabricator.wikimedia.org/T228969) [05:56:39] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525721 [05:57:46] (03PS2) 10Marostegui: mariadb: Provision db2123 into s5 codfw [puppet] - 10https://gerrit.wikimedia.org/r/525720 (https://phabricator.wikimedia.org/T228969) [05:59:03] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525721 (owner: 10Marostegui) [05:59:26] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2123 into s5 codfw [puppet] - 10https://gerrit.wikimedia.org/r/525720 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:59:57] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525721 (owner: 10Marostegui) [06:00:16] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525721 (owner: 10Marostegui) [06:01:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1096 (duration: 00m 47s) [06:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:24] (03PS6) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) [06:04:22] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [06:04:26] (03PS1) 10Marostegui: site.pp: Remove db2123 from spare [puppet] - 10https://gerrit.wikimedia.org/r/525722 [06:05:20] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2123 from spare [puppet] - 10https://gerrit.wikimedia.org/r/525722 (owner: 10Marostegui) [06:06:08] (03PS7) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) [06:09:06] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525723 [06:10:47] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525723 (owner: 10Marostegui) [06:11:42] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525723 (owner: 10Marostegui) [06:11:57] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525723 (owner: 10Marostegui) [06:12:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1096 (duration: 00m 47s) [06:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:41] 10Operations, 10netops: AS36351 BGP session down on cr2-eqiad - https://phabricator.wikimedia.org/T229085 (10elukey) p:05Triage→03Normal [06:30:44] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:10] PROBLEM - puppet last run on restbase1023 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:28] PROBLEM - puppet last run on puppetdb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:58:28] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:59:58] RECOVERY - puppet last run on restbase1023 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:00:16] RECOVERY - puppet last run on puppetdb2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:01:38] (03CR) 10Elukey: "LGTM, seems clear enough.. There is another possible combination with SetEnvIf and RewriteRule but probably not worth it, the RewriteCond " [puppet] - 10https://gerrit.wikimedia.org/r/525516 (owner: 10Jbond) [07:29:42] (03CR) 10Filippo Giunchedi: [C: 03+1] mediawiki::webserver: add mtail to gather latency, error rate metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [07:39:39] (03PS2) 10Filippo Giunchedi: prometheus: aggregate puppet failure percent by cluster [puppet] - 10https://gerrit.wikimedia.org/r/525502 (https://phabricator.wikimedia.org/T228878) [07:39:41] (03PS2) 10Filippo Giunchedi: monitoring: allow logstash in dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/525539 [07:39:43] (03PS3) 10Filippo Giunchedi: monitoring: add logstash 5xx dashboard to availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/525511 (https://phabricator.wikimedia.org/T228878) [07:39:45] (03PS3) 10Filippo Giunchedi: prometheus: calculate nginx/varnish availability over 2m too [puppet] - 10https://gerrit.wikimedia.org/r/525512 (https://phabricator.wikimedia.org/T228878) [07:48:14] 10Operations, 10Puppet, 10Continuous-Integration-Config: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi The original issue is gone (local tests on Buster), being bold and resolving the task for now but feel free to reopen,... [07:57:57] (03PS3) 10Filippo Giunchedi: monitoring: allow logstash in dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/525539 [08:04:39] (03PS2) 10Jbond: puppetmaster: use mod rewrite for conditional proxypass [puppet] - 10https://gerrit.wikimedia.org/r/525516 [08:05:35] (03PS1) 10Elukey: Rename sre.hadoop.rolling-restart-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525771 [08:05:39] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: use mod rewrite for conditional proxypass [puppet] - 10https://gerrit.wikimedia.org/r/525516 (owner: 10Jbond) [08:06:06] PROBLEM - puppet last run on cp5003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:07:34] PROBLEM - puppet last run on cp5006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:07:34] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:09:19] (03PS3) 10Jbond: puppetmaster: use mod rewrite for conditional proxypass [puppet] - 10https://gerrit.wikimedia.org/r/525516 [08:11:20] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: allow logstash in dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/525539 (owner: 10Filippo Giunchedi) [08:11:39] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: add logstash 5xx dashboard to availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/525511 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [08:11:47] (03PS4) 10Filippo Giunchedi: monitoring: add logstash 5xx dashboard to availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/525511 (https://phabricator.wikimedia.org/T228878) [08:14:33] (03CR) 10Elukey: [C: 03+2] Rename sre.hadoop.rolling-restart-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525771 (owner: 10Elukey) [08:14:51] (03CR) 10Alaa Sarhan: varnish: Do not strip the cache out of Special:EntityData if revision is set (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499) (owner: 10Ladsgroup) [08:18:39] PROBLEM - puppet last run on cp5007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:18:51] (03PS3) 10Filippo Giunchedi: prometheus: aggregate puppet failure percent by cluster [puppet] - 10https://gerrit.wikimedia.org/r/525502 (https://phabricator.wikimedia.org/T228878) [08:19:33] PROBLEM - puppet last run on cp5004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:05] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: aggregate puppet failure percent by cluster [puppet] - 10https://gerrit.wikimedia.org/r/525502 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [08:20:13] PROBLEM - puppet last run on cp5001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:51] (03PS4) 10Filippo Giunchedi: prometheus: calculate nginx/varnish availability over 2m too [puppet] - 10https://gerrit.wikimedia.org/r/525512 (https://phabricator.wikimedia.org/T228878) [08:21:17] looks like puppet on cp-eqsin is suffering [08:21:48] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: calculate nginx/varnish availability over 2m too [puppet] - 10https://gerrit.wikimedia.org/r/525512 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [08:22:49] PROBLEM - puppet last run on cp5009 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:29] PROBLEM - DNS cloudvirt1017.mgmt on cloudvirt1017.mgmt is CRITICAL: DNS CRITICAL - expected 10.65.5.67 but got 10.65.5.68 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:26:05] PROBLEM - puppet last run on cp5005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/524934 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:28:47] (03CR) 10Filippo Giunchedi: "LGTM, modulo inline comments, nice work!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525659 (owner: 10Ayounsi) [08:29:17] (03PS1) 10Elukey: Add more cumin aliases for Druid clusters [puppet] - 10https://gerrit.wikimedia.org/r/525775 (https://phabricator.wikimedia.org/T229003) [08:30:07] PROBLEM - puppet last run on cp5010 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:20] I'm taking a look at the puppet failures btw [08:30:55] PROBLEM - puppet last run on cp5002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:55] PROBLEM - puppet last run on cp5008 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:24] (took it to -traffic) [08:34:23] PROBLEM - DNS cloudvirt1016.mgmt on cloudvirt1016.mgmt is CRITICAL: DNS CRITICAL - expected 10.65.5.68 but got 10.65.5.67 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:34:29] (03CR) 10Ema: [C: 03+1] "YES!" [puppet] - 10https://gerrit.wikimedia.org/r/525259 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [08:34:41] PROBLEM - puppet last run on cp5011 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[unified-new-ec-prime256v1-create-ocsp],Exec[unified-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:07] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/17628/" [puppet] - 10https://gerrit.wikimedia.org/r/525259 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [08:42:38] !log Add db2123 to tendril and zarcillo - T228969 [08:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:47] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [08:43:03] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10fgiunchedi) Not sure if known or expected already, but phase checks for new PDUs A3/A4/A5/A7 show up in icinga as UNKNOWN with `External command error: Error... [08:43:32] (03CR) 10Elukey: [C: 03+2] Add more cumin aliases for Druid clusters [puppet] - 10https://gerrit.wikimedia.org/r/525775 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [08:45:27] (03PS1) 10Elukey: Add sre.druid.roll-restart-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525776 (https://phabricator.wikimedia.org/T229003) [08:48:16] (03PS2) 10Elukey: Add sre.druid.roll-restart-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525776 (https://phabricator.wikimedia.org/T229003) [08:53:51] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Provision db2123 into s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525777 (https://phabricator.wikimedia.org/T228969) [09:10:54] (03PS1) 10Ema: cp-eqsin: do not deploy acme-chief unified certs [puppet] - 10https://gerrit.wikimedia.org/r/525778 (https://phabricator.wikimedia.org/T229091) [09:17:25] (03CR) 10Vgutierrez: [C: 03+1] cp-eqsin: do not deploy acme-chief unified certs [puppet] - 10https://gerrit.wikimedia.org/r/525778 (https://phabricator.wikimedia.org/T229091) (owner: 10Ema) [09:17:36] 10Operations, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10fsero) @greg thanks for following this, i definitely would... [09:18:08] (03CR) 10Ema: [C: 03+2] cp-eqsin: do not deploy acme-chief unified certs [puppet] - 10https://gerrit.wikimedia.org/r/525778 (https://phabricator.wikimedia.org/T229091) (owner: 10Ema) [09:22:17] 10Operations, 10Traffic, 10Patch-For-Review: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) This happens at the same time that the unified cert is being renewed: ` Jul 26 08:00:02 acmechief1001 acme-chief-backend[8198]: Number of certific... [09:24:15] RECOVERY - puppet last run on cp5006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:24:15] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:24:55] RECOVERY - puppet last run on cp5007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:25:07] RECOVERY - puppet last run on cp5011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:25:49] RECOVERY - puppet last run on cp5004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:25:53] RECOVERY - puppet last run on cp5010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:26:25] RECOVERY - puppet last run on cp5001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:26:37] RECOVERY - puppet last run on cp5008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:26:37] RECOVERY - puppet last run on cp5002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:27:09] RECOVERY - puppet last run on cp5005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:28:01] (03CR) 10ArielGlenn: [C: 03+1] db-eqiad,db-codfw.php: Provision db2123 into s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525777 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [09:28:19] RECOVERY - puppet last run on cp5003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:28:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Provision db2123 into s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525777 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [09:29:01] RECOVERY - puppet last run on cp5009 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:29:38] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db2123 into s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525777 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [09:29:53] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db2123 into s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525777 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [09:31:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db2123 into s5 T228969 (duration: 00m 48s) [09:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:38] T228969: Productionize db21[21-30} - https://phabricator.wikimedia.org/T228969 [09:32:23] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2123 into s5 T228969 (duration: 00m 47s) [09:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:01] (03CR) 10Elukey: [C: 04-1] Add sre.druid.roll-restart-workers.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/525776 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [09:46:05] (03PS1) 10Marostegui: install_server: Do not reimage dbproxy10[20-21] [puppet] - 10https://gerrit.wikimedia.org/r/525783 [09:46:50] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage dbproxy10[20-21] [puppet] - 10https://gerrit.wikimedia.org/r/525783 (owner: 10Marostegui) [09:46:59] 10Operations, 10Traffic, 10Patch-For-Review: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) update-ocsp is configured to use the certificate only version to perform the OCSP stapling: `` vgutierrez@cp5001:/etc/update-ocsp.d$ cat unified-n... [09:50:43] 10Operations, 10Traffic, 10Patch-For-Review: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) p:05Normal→03High This is a big issue, cause right now due to the invalid state of update-ocsp/acme-chief, nginx cannot be restarted in the cp... [10:07:49] (03PS1) 10Vgutierrez: Revert "cp-eqsin: do not deploy acme-chief unified certs" [puppet] - 10https://gerrit.wikimedia.org/r/525787 [10:08:47] (03CR) 10jerkins-bot: [V: 04-1] Revert "cp-eqsin: do not deploy acme-chief unified certs" [puppet] - 10https://gerrit.wikimedia.org/r/525787 (owner: 10Vgutierrez) [10:09:10] of course.. the revert explanation doesn't fit in one line according to our commit linter [10:09:42] (03PS2) 10Vgutierrez: Revert "cp-eqsin: do not deploy acme-chief unified certs" [puppet] - 10https://gerrit.wikimedia.org/r/525787 [10:12:25] (03PS3) 10Elukey: Add sre.druid.roll-restart-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525776 (https://phabricator.wikimedia.org/T229003) [10:27:03] (03CR) 10Ema: [C: 03+1] Revert "cp-eqsin: do not deploy acme-chief unified certs" [puppet] - 10https://gerrit.wikimedia.org/r/525787 (owner: 10Vgutierrez) [10:27:21] (03CR) 10Vgutierrez: [C: 03+2] Revert "cp-eqsin: do not deploy acme-chief unified certs" [puppet] - 10https://gerrit.wikimedia.org/r/525787 (owner: 10Vgutierrez) [10:27:35] (03PS3) 10Vgutierrez: Revert "cp-eqsin: do not deploy acme-chief unified certs" [puppet] - 10https://gerrit.wikimedia.org/r/525787 [10:28:08] 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) [10:38:03] 10Operations, 10Traffic: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) p:05High→03Normal So, I've manually generated the missing versions on acmechief1001: ` >>> cert = Certificate.load('/var/lib/acme-chief/certs/unified/new/rsa-2048.c... [10:40:00] 10Operations, 10Acme-chief, 10Traffic: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096 (10Vgutierrez) [10:42:04] 10Operations, 10Traffic: Provide ensure => absent support for acme_chief::cert define - https://phabricator.wikimedia.org/T229097 (10Vgutierrez) [10:43:27] 10Operations, 10Acme-chief, 10Traffic: Provide ensure => absent support for acme_chief::cert define - https://phabricator.wikimedia.org/T229097 (10Vgutierrez) p:05Triage→03Normal [10:48:09] (03PS1) 10Fsero: k8s: changing CI limits so actual charts can be tested [deployment-charts] - 10https://gerrit.wikimedia.org/r/525789 (https://phabricator.wikimedia.org/T229073) [10:48:30] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: changing CI limits so actual charts can be tested [deployment-charts] - 10https://gerrit.wikimedia.org/r/525789 (https://phabricator.wikimedia.org/T229073) (owner: 10Fsero) [10:48:37] (03PS2) 10Fsero: k8s: changing CI limits so actual charts can be tested [deployment-charts] - 10https://gerrit.wikimedia.org/r/525789 (https://phabricator.wikimedia.org/T229073) [10:48:41] (03CR) 10Fsero: [V: 03+2 C: 03+2] k8s: changing CI limits so actual charts can be tested [deployment-charts] - 10https://gerrit.wikimedia.org/r/525789 (https://phabricator.wikimedia.org/T229073) (owner: 10Fsero) [11:22:25] 10Operations, 10DC-Ops: Phase monitoring for new PDUs - https://phabricator.wikimedia.org/T229101 (10fgiunchedi) [11:23:08] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10fgiunchedi) >>! In T226778#5368034, @fgiunchedi wrote: > Not sure if known or expected already, but phase checks for new PDUs A3/A4/A5/A7 show up in icinga a... [11:35:15] 10Operations, 10DC-Ops: Phase monitoring for new PDUs - https://phabricator.wikimedia.org/T229101 (10faidon) > whereas ulsfo PDUs installed in T209101 are currently missing icinga phase monitoring checks (i.e. only ping checks) Note that ulsfo does not have 3-phase power so it makes sense here to be different... [12:29:30] (03PS1) 10Marostegui: db-codfw.php: Depool db2038, pool db2113 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525799 (https://phabricator.wikimedia.org/T221533) [12:31:26] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2038, pool db2113 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525799 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [12:32:24] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2038, pool db2113 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525799 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [12:34:32] 10Operations, 10ops-codfw, 10serviceops: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10jijiki) @Papaul Can you lets us know what are our options (if any?) [12:35:44] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2123 into s5 vslow T221533 (duration: 00m 50s) [12:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:53] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [12:36:23] (03CR) 10jenkins-bot: db-codfw.php: Depool db2038, pool db2113 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525799 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [12:45:41] !log rebooting labsdb1012.eqiad.wmnet for updates T224228 [12:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:26] !log Change user email assigned to SUL user Stansfield (T229004) [13:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:33] T229004: User:Stansfield has forgotten their password, need a reset via CLI - https://phabricator.wikimedia.org/T229004 [13:07:29] !log rebooting labstore1006.wikimedia.org for updates T224228 [13:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:19] (03PS1) 10Elukey: Add cumin alias for Kafka logging eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/525802 [13:14:34] (03CR) 10Filippo Giunchedi: [C: 03+1] Add cumin alias for Kafka logging eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/525802 (owner: 10Elukey) [13:16:37] (03CR) 10Elukey: [C: 03+2] Add cumin alias for Kafka logging eqiad/codfw [puppet] - 10https://gerrit.wikimedia.org/r/525802 (owner: 10Elukey) [13:17:09] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10RobH) Please note that replacement disks have now been ordered on T228302 and should arrive sometime next week. The 3 day shipping option was selected, so we currently expect this to ship on Friday/Monday and arr... [13:20:23] (03CR) 10Eevans: table-properties: Initial commit (031 comment) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [13:33:39] (03PS1) 10Elukey: Add sre.kafka.roll-restart-brokers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/525804 (https://phabricator.wikimedia.org/T229003) [13:36:20] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [13:36:30] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [13:39:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) a:05RobH→03None [13:41:40] (03PS1) 10RobH: decom lvs100[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/525805 (https://phabricator.wikimedia.org/T224223) [13:41:43] !log updated labstore100[67].wikimedia.org performance scaling_governor T225713 [13:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:51] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [13:41:59] !log sudo -i reprepro --ignore=wrongdistribution include stretch-wikimedia /home/fsero/envoyproxy_1.11.0~wmf1_amd64.changes [13:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:14] (03CR) 10RobH: [C: 03+2] decom lvs100[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/525805 (https://phabricator.wikimedia.org/T224223) (owner: 10RobH) [13:42:33] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10JHedden) [13:43:43] (03PS1) 10Fsero: envoy: bump docker image to 1.11 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/525806 [13:44:05] (03CR) 10Fsero: [V: 03+2 C: 03+2] envoy: bump docker image to 1.11 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/525806 (owner: 10Fsero) [13:44:28] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Marostegui) >>! In T225713#5335975, @jcrespo wrote: > FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wik... [13:52:57] (03PS1) 10Jhedden: Revert "dumps dist: switch active web to labstore1007" [puppet] - 10https://gerrit.wikimedia.org/r/525812 [13:54:51] (03PS2) 10Jhedden: Revert "dumps dist: switch active web to labstore1007" [puppet] - 10https://gerrit.wikimedia.org/r/525812 [13:56:00] (03CR) 10Jhedden: [C: 03+2] Revert "dumps dist: switch active web to labstore1007" [puppet] - 10https://gerrit.wikimedia.org/r/525812 (owner: 10Jhedden) [14:04:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) [14:05:00] (03PS4) 10Ema: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) [14:05:03] (03PS2) 10Ema: ATS: Vary-slotting for X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/525310 (https://phabricator.wikimedia.org/T227432) [14:05:05] (03PS1) 10Ema: ATS: add-vary Lua plugin [puppet] - 10https://gerrit.wikimedia.org/r/525815 (https://phabricator.wikimedia.org/T227432) [14:05:30] (03CR) 10Elukey: "Going to deploy this change on Monday Aaron!" [puppet] - 10https://gerrit.wikimedia.org/r/525224 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [14:06:27] PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[wikibase-new-ec-prime256v1-create-ocsp],Exec[wikibase-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:08:39] PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[wikibase-new-ec-prime256v1-create-ocsp],Exec[wikibase-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:08:58] (03CR) 10Fsero: [C: 04-1] helmfile: Update README to mention ".hfenv" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/525468 (owner: 10Thcipriani) [14:09:45] PROBLEM - puppet last run on cp1075 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[wikibase-new-ec-prime256v1-create-ocsp],Exec[wikibase-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:09:49] PROBLEM - puppet last run on cp4031 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[wikibase-new-ec-prime256v1-create-ocsp],Exec[wikibase-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:09:50] mmm [14:09:54] not good? [14:10:34] the cache hosts errors look like what we've seen this morning, acme-chief sadness [14:10:39] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[wikibase-new-ec-prime256v1-create-ocsp],Exec[wikibase-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:10:50] !log disable puppet on cache nodes T229091 [14:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:57] T229091: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 [14:11:01] PROBLEM - puppet last run on cp1077 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[wikibase-new-ec-prime256v1-create-ocsp],Exec[wikibase-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:11:12] same for wikibase damn [14:11:41] let me take care of that [14:12:31] vgutierrez: ty! [14:12:46] my pleasure :D [14:13:14] we are breaking servers because we miss you :) [14:13:52] ❤️ [14:15:06] (03PS1) 10Andrew Bogott: Swap mac addresses for cloudvirt1016 and 1017 [puppet] - 10https://gerrit.wikimedia.org/r/525817 (https://phabricator.wikimedia.org/T228691) [14:16:14] (03CR) 10Andrew Bogott: [C: 03+2] Swap mac addresses for cloudvirt1016 and 1017 [puppet] - 10https://gerrit.wikimedia.org/r/525817 (https://phabricator.wikimedia.org/T228691) (owner: 10Andrew Bogott) [14:17:00] (03CR) 10Bstorm: "I see the field is usually NULL anyway. I don't know if that's always true? We could set the field to "NULL as" first (unblocking the dr" [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [14:17:40] ema: done, you can reenable puppet now [14:19:15] (03CR) 10Marostegui: "> I see the field is usually NULL anyway. I don't know if that's" [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [14:19:16] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1016 with 10G interfaces - https://phabricator.wikimedia.org/T228692 (10Andrew) correction: the MAC address on this host is F4:E9:D4:BA:B7:10 [14:19:26] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1017 with 10G interfaces - https://phabricator.wikimedia.org/T228691 (10Andrew) correction: the MAC address for this host is F4:E9:D4:BA:B7:40 [14:20:13] vgutierrez: ack, trying on cp1077 [14:22:10] cool [14:22:15] RECOVERY - puppet last run on cp1077 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:22:19] \o/ [14:23:00] vgutierrez: all good. Please document whatever you've done somewhere so that we can take care of it without interrupting your vacation, should it happen again [14:23:20] !log re-enable puppet on cache nodes T229091 [14:23:21] I've did it before in the phab task ;) [14:23:24] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [14:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:27] T229091: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 [14:23:30] s/did/done/ [14:24:06] (03PS2) 10Bstorm: maintain-views: Remove afl_log_id [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [14:24:23] 10Operations, 10Traffic: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) As it's been done with unified, wikibase required the same patch: ` >>> from acme_chief.x509 import Certificate >>> from acme_chief.acme_chief import CERTIFICATE_TYPES... [14:24:31] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[wikibase-new-ec-prime256v1-create-ocsp],Exec[wikibase-new-rsa-2048-create-ocsp] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:24:51] ema: https://phabricator.wikimedia.org/T229091#5368202 and now https://phabricator.wikimedia.org/T229091#5368765 [14:25:29] RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:25:55] but now that's been fixed, next occurrence of this bug should happen in 2 months and I plan to get a version of acme-chief fixing the issue way sooner [14:26:13] unified & wikibase are the only ones configured with OCSP stapling AFAIK [14:26:33] RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:26:37] RECOVERY - puppet last run on cp4031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:26:57] vgutierrez: ack, thanks [14:27:03] (03CR) 10Marostegui: "> > Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [14:27:09] my pleasure [14:27:29] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:28:53] RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:29:07] (03CR) 10Anomie: "I have no opinion about the patch itself." [puppet] - 10https://gerrit.wikimedia.org/r/525142 (https://phabricator.wikimedia.org/T85499) (owner: 10Ladsgroup) [14:30:05] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:31:22] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [14:33:37] (03PS3) 10Filippo Giunchedi: Consolidate 'critical' and 'contact groups' logic [puppet] - 10https://gerrit.wikimedia.org/r/525535 (https://phabricator.wikimedia.org/T228878) [14:33:45] (03PS3) 10Filippo Giunchedi: monitoring: tweak description for paging alerts [puppet] - 10https://gerrit.wikimedia.org/r/525536 (https://phabricator.wikimedia.org/T228878) [14:34:34] 10Operations, 10serviceops, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: create swift container-to-container synchronization metrics - https://phabricator.wikimedia.org/T229117 (10fsero) [14:36:37] 10Operations, 10serviceops, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: create a docker_registry_codfw swift container backup - https://phabricator.wikimedia.org/T229118 (10fsero) [14:37:18] (03PS2) 10Ema: ATS: add-vary Lua plugin [puppet] - 10https://gerrit.wikimedia.org/r/525815 (https://phabricator.wikimedia.org/T227432) [14:37:20] (03PS5) 10Ema: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) [14:37:22] (03PS3) 10Ema: ATS: Vary-slotting for X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/525310 (https://phabricator.wikimedia.org/T227432) [14:38:45] (03CR) 10Filippo Giunchedi: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/525536 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [14:39:32] (03CR) 10Filippo Giunchedi: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/525535 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [14:40:34] (03CR) 10Marostegui: "> > Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [14:43:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) [14:43:33] (03CR) 10Bstorm: [C: 03+1] "Sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/525713 (https://phabricator.wikimedia.org/T226851) (owner: 10Marostegui) [14:45:06] (03PS1) 10RobH: decom mgmt dns for lvs100[1-6] [dns] - 10https://gerrit.wikimedia.org/r/525822 (https://phabricator.wikimedia.org/T224223) [14:45:41] (03CR) 10Krinkle: [C: 04-1] Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [14:45:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) a:03ayounsi @ayounsi, Per your request, we are assigning this to you for the switch configuration removal for lvs100[1-6]. All of the systems have... [14:58:01] (03PS2) 10Krinkle: Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz) [14:58:49] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: add el refine job [puppet] - 10https://gerrit.wikimedia.org/r/525824 (https://phabricator.wikimedia.org/T226698) [15:00:12] (03CR) 10Herron: [C: 03+1] "LGTM let's try it! One minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/525536 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [15:00:59] (03PS3) 10Krinkle: Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz) [15:01:03] (03CR) 10Krinkle: [C: 03+1] Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz) [15:01:10] (03CR) 10Krinkle: [C: 03+2] Add entries to wgCSPFalsePositiveUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504474 (https://phabricator.wikimedia.org/T207900) (owner: 10Krinkle) [15:02:09] !log krinkle@deploy1001: php-1.34.0-wmf.15 is still dirty on extensions/CheckUser [15:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:47] (03PS6) 10Krinkle: Add entries to wgCSPFalsePositiveUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504474 (https://phabricator.wikimedia.org/T207900) [15:07:55] (03CR) 10Krinkle: [C: 03+2] Add entries to wgCSPFalsePositiveUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504474 (https://phabricator.wikimedia.org/T207900) (owner: 10Krinkle) [15:09:42] (03Merged) 10jenkins-bot: Add entries to wgCSPFalsePositiveUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504474 (https://phabricator.wikimedia.org/T207900) (owner: 10Krinkle) [15:10:03] (03CR) 10jenkins-bot: Add entries to wgCSPFalsePositiveUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504474 (https://phabricator.wikimedia.org/T207900) (owner: 10Krinkle) [15:12:27] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: de0822497919b1b (duration: 00m 48s) [15:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:08] (03PS3) 10Ema: ATS: add-vary Lua plugin [puppet] - 10https://gerrit.wikimedia.org/r/525815 (https://phabricator.wikimedia.org/T227432) [15:15:10] (03PS6) 10Ema: ATS: Vary-slotting for PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/525274 (https://phabricator.wikimedia.org/T206339) [15:15:12] (03PS4) 10Ema: ATS: Vary-slotting for X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/525310 (https://phabricator.wikimedia.org/T227432) [15:15:42] 10Operations, 10LDAP-Access-Requests, 10Security-Team, 10Trust-and-Safety: Add sguebo_WMF to WMF LDAP group - https://phabricator.wikimedia.org/T228927 (10herron) 05Open→03Resolved a:03herron @sguebo_WMF (LDAP username sguebo) has been added to the wmf ldap group [15:16:49] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10herron) a:03Nuria [15:19:04] 10Operations, 10SRE-Access-Requests: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10RobH) [15:19:22] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:19:40] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:22:15] 10Operations, 10SRE-Access-Requests: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10RobH) p:05Triage→03Normal [15:23:33] 10Operations, 10SRE-Access-Requests: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10RobH) a:03Jclark-ctr @Jclark-ctr: We will need you to do the following to get shell access: * User has signed the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document. *... [15:26:22] (03CR) 10CDanis: [C: 04-2] "This won't be merged until next week, but I would appreciate comments on the code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [15:26:24] 10Operations, 10DC-Ops, 10Traffic: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) [15:28:15] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10RobH) a:03MoritzMuehlenhoff @MoritzMuehlenhoff, Have the grans for this system been removed so we can move forward with decommission? Directed this to you since you created this task, and I as... [15:29:17] 10Operations, 10ops-codfw, 10decommission: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10RobH) a:03MoritzMuehlenhoff >>! In T220503#5369016, @RobH wrote: > @MoritzMuehlenhoff, > > Have the grants for this system been removed so we can move forward with decommission? Directed this to y... [15:46:06] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:46:30] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:46:58] (03CR) 10Ayounsi: "Thanks! Fixed." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/525659 (owner: 10Ayounsi) [15:47:06] (03PS3) 10Ayounsi: Anycast: Add Prometheus exporter to Bird [puppet] - 10https://gerrit.wikimedia.org/r/525659 [15:48:57] 10Operations, 10SRE-Access-Requests: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10Jclark-ctr) a:05Jclark-ctr→03wiki_willy ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC+1duGbT11VE4IV3KKFzdmHhSl2fAA0CkL93edalw2yqroMxzHjah7GwKB5csjrrbqhn+po0478jsU8OG8hgJBRKSq2cG04ryQk8MVSIy6gnqQ... [15:58:07] (03PS2) 10Thcipriani: helmfile: Update README to mention ".hfenv" [deployment-charts] - 10https://gerrit.wikimedia.org/r/525468 [15:58:26] (03CR) 10Thcipriani: helmfile: Update README to mention ".hfenv" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/525468 (owner: 10Thcipriani) [16:05:04] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.15/extensions/Flow/includes/Search/Iterators/TopicIterator.php: T229114 make orderUUID public, as it is needed by other classes for Dumps (duration: 00m 47s) [16:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:12] T229114: flow dumps broken - https://phabricator.wikimedia.org/T229114 [16:17:38] 10Operations, 10SRE-Access-Requests: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10wiki_willy) Approved for the following: [16:18:24] 10Operations, 10SRE-Access-Requests: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10wiki_willy) a:05wiki_willy→03RobH Approved for the following: approval to access systems as your manager approval to access dc-ops group as the dc-ops group manager [16:18:46] (03CR) 10Ottomata: [C: 03+1] role::analytics_test_cluster::coordinator: add el refine job [puppet] - 10https://gerrit.wikimedia.org/r/525824 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [16:25:35] 10Operations, 10serviceops, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: create a docker_registry_codfw swift container backup - https://phabricator.wikimedia.org/T229118 (10herron) p:05Triage→03Normal [16:25:47] 10Operations, 10serviceops, 10Release-Engineering-Team-TODO (201907), 10Wikimedia-Incident: create swift container-to-container synchronization metrics - https://phabricator.wikimedia.org/T229117 (10herron) p:05Triage→03Normal [16:26:23] 10Operations, 10DC-Ops, 10observability: Phase monitoring for new PDUs - https://phabricator.wikimedia.org/T229101 (10herron) p:05Triage→03Normal [16:26:55] 10Operations, 10Acme-chief, 10Traffic: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096 (10herron) p:05Triage→03Normal [16:29:08] 10Operations, 10Puppet, 10observability: Use git commit id as "configuration version" for puppet - https://phabricator.wikimedia.org/T228854 (10herron) p:05Triage→03Normal [16:34:46] 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, and 3 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10CCicalese_WMF) [16:58:36] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:03:16] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10serviceops-radar, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Krinkle) [17:04:21] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10Patch-For-Review: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 (10Krinkle) [17:05:10] 10Operations, 10Discovery, 10MediaWiki-Debug-Logger, 10HHVM: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084 (10Krinkle) [17:05:30] (03PS1) 10RobH: adding jclark to shell and dc ops group [puppet] - 10https://gerrit.wikimedia.org/r/525847 (https://phabricator.wikimedia.org/T229124) [17:06:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10RobH) [17:07:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10RobH) Please note that the patchset is prepared and this is now in the 3 day waiting period. If no objections are noted, this can be merged on Wednesday, 2019-07-31. [17:08:02] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: add jclark to datacenter-ops group - https://phabricator.wikimedia.org/T229124 (10RobH) a:05RobH→03None [17:19:14] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:19:54] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:21:52] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:35:34] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10CCicalese_WMF) [17:36:36] (03CR) 10Paladox: [C: 03+1] gerrit: use gerrit-deployers not gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/525444 (owner: 10Thcipriani) [17:37:17] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [17:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:22] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [17:37:26] 10Operations, 10ops-eqiad, 10decommission: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `sulfur.wikimedia.org` - sulfur.wikimedia.org - Removed from Puppet master and PuppetDB - Downtimed... [17:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:37] 10Operations, 10ops-eqiad, 10decommission: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10RobH) [17:37:39] 10Operations, 10ops-eqiad: Degraded RAID on sulfur - https://phabricator.wikimedia.org/T229134 (10ops-monitoring-bot) [17:38:56] (03PS1) 10RobH: return sulfur to spares [dns] - 10https://gerrit.wikimedia.org/r/525852 (https://phabricator.wikimedia.org/T224475) [17:40:03] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10CCicalese_WMF) [17:40:27] (03CR) 10RobH: [C: 03+2] return sulfur to spares [dns] - 10https://gerrit.wikimedia.org/r/525852 (https://phabricator.wikimedia.org/T224475) (owner: 10RobH) [17:40:29] (03PS1) 10RobH: return sulfur to spares [puppet] - 10https://gerrit.wikimedia.org/r/525853 (https://phabricator.wikimedia.org/T224475) [17:40:51] (03CR) 10RobH: [C: 03+2] return sulfur to spares [puppet] - 10https://gerrit.wikimedia.org/r/525853 (https://phabricator.wikimedia.org/T224475) (owner: 10RobH) [17:43:13] 10Operations, 10ops-eqiad, 10decommission: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10RobH) [17:44:16] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:44:59] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Cassandra Operational ), 10Services (watching): restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10CCicalese_WMF) [17:45:30] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:48:06] 10Operations, 10Traffic, 10Core Platform Team (Services Operations): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10WDoranWMF) [17:49:01] 10Operations, 10Beta-Cluster-Infrastructure, 10Mail, 10MediaWiki-Email, and 2 others: [betacluster] Cannot confirm email address - confirmation never received - https://phabricator.wikimedia.org/T227714 (10Etonkovidova) 05Open→03Resolved a:03Etonkovidova @herron I use two email addresses when testin... [17:50:37] (03PS1) 10Ppchelko: [EventBus] Switch resource_change event to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525854 (https://phabricator.wikimedia.org/T211248) [17:51:33] (03CR) 10Nuria: [C: 03+1] "This is some smooth migration!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525854 (https://phabricator.wikimedia.org/T211248) (owner: 10Ppchelko) [17:52:41] long time no see :D [17:53:18] we noticed wikimedia was looking at using envoy and were wondering what you were going to be using it for, and if anyone wanted to chat with us about anything :) [17:56:11] {{disamb}} [17:56:53] (envoy proxy, that is) [17:57:48] first result on google for "envoy software" is "Envoy is a visitor sign software that uses application for iPad to welcome in guests and manage their visit." [17:58:30] Ryan_Lane: https://phabricator.wikimedia.org/T215810 [17:58:46] redis proxy, seemingly atm [18:00:31] yeah, that's what I was seeing in puppet too. wanted to see if it was being used for anything else [18:00:38] fsero might know [18:00:53] https://www.envoyproxy.io/ <-- [18:01:51] !log remove lvs100[1-6] switch config from asw2-a-eqiad - T224223 [18:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:59] T224223: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 [18:04:45] (03PS1) 10Ppchelko: Remove RESTBase graphite alerts. [puppet] - 10https://gerrit.wikimedia.org/r/525856 (https://phabricator.wikimedia.org/T185089) [18:08:40] !log remove lvs100[1-6] switch config from asw2-b-eqiad - T224223 [18:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:47] T224223: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 [18:09:59] (03CR) 10Ppchelko: "Filed https://phabricator.wikimedia.org/T229137 to reevaluate RESTBase alerts and possibly recreate them based on grafana." [puppet] - 10https://gerrit.wikimedia.org/r/525856 (https://phabricator.wikimedia.org/T185089) (owner: 10Ppchelko) [18:12:12] it's a Ryan_Lane :) [18:12:34] yep! howdy :) [18:14:37] (03Abandoned) 10Paladox: Gerrit: Wrap