[00:09:06] (03PS1) 10Bstorm: wikireplicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514199 (https://phabricator.wikimedia.org/T223406) [00:09:26] !log T223406 repooled labsdb1011 after completing view updates [00:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:32] T223406: Remove reference to fields replaced by the actor table from WMCS views - https://phabricator.wikimedia.org/T223406 [00:10:19] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10bd808) +1 to close [00:10:21] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514199 (https://phabricator.wikimedia.org/T223406) (owner: 10Bstorm) [00:16:25] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) Bellow is my proposal to add validation to our config. There are many possible ways of doing it, so feedback from @faidon or @mark is welcome! * The RPKI BGP communities are non transitive so `community delete RPKI... [01:02:12] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514202 [01:06:22] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514202 (owner: 10Bstorm) [01:10:57] !log T223406 depooled/repooled labsdb1009 for view updates [01:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:03] T223406: Remove reference to fields replaced by the actor table from WMCS views - https://phabricator.wikimedia.org/T223406 [01:43:45] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [01:46:35] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [01:55:43] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 881.75 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:43:09] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:59:56] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) a:05herron→03Papaul Kafka-main200[123], and kafka-main2005 are installed, have had the initial puppet run applied and are now marked "staged" in netbox. Kafka-main2... [03:08:57] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [04:05:45] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:12:41] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [04:14:05] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.344 second response time https://phabricator.wikimedia.org/T174916 [04:21:15] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [04:26:57] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.338 second response time https://phabricator.wikimedia.org/T174916 [04:31:15] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [04:55:08] (03PS1) 10Marostegui: db-eqiad.php: Remove db1081 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514208 (https://phabricator.wikimedia.org/T224852) [04:56:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Remove db1081 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514208 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:57:36] (03Merged) 10jenkins-bot: db-eqiad.php: Remove db1081 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514208 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [04:59:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1081 from API (duration: 00m 49s) [04:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:21] (03CR) 10jenkins-bot: db-eqiad.php: Remove db1081 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514208 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:06:11] (03PS1) 10Marostegui: mariadb: Move db2065 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/514209 (https://phabricator.wikimedia.org/T221533) [05:09:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db2065 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/514209 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [05:16:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514210 (https://phabricator.wikimedia.org/T224852) [05:17:51] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514210 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:18:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514210 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:18:58] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514210 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:19:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097 for upgrade (duration: 00m 47s) [05:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:59] !log Stop MySQL on db1097 for upgrade [05:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514211 [05:25:28] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514211 (owner: 10Marostegui) [05:26:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514211 (owner: 10Marostegui) [05:26:33] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514211 (owner: 10Marostegui) [05:27:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097 after upgrade (duration: 00m 46s) [05:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514214 [05:38:44] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514214 (owner: 10Marostegui) [05:39:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514214 (owner: 10Marostegui) [05:39:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514214 (owner: 10Marostegui) [05:40:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 for upgrade (duration: 00m 48s) [05:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:47] !log Stop MySQL on db1091 for MySQL upgrade T224852 [05:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:52] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [05:45:22] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514215 [05:49:03] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514215 (owner: 10Marostegui) [05:49:56] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514215 (owner: 10Marostegui) [05:50:16] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514215 (owner: 10Marostegui) [05:51:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1091 after upgrade (duration: 00m 47s) [05:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:49] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514216 [05:54:00] !log Stop MySQL on db2078:m3 - T221533 [05:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:05] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [05:57:49] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.273 second response time https://phabricator.wikimedia.org/T174916 [06:02:23] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514216 (owner: 10Marostegui) [06:03:01] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [06:03:25] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514216 (owner: 10Marostegui) [06:04:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1091 after upgrade (duration: 00m 47s) [06:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:57] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514216 (owner: 10Marostegui) [06:08:15] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.507 second response time https://phabricator.wikimedia.org/T174916 [06:12:37] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [06:15:23] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.976 second response time https://phabricator.wikimedia.org/T174916 [06:21:05] !log restart pdfrender on scb1002 (flapping) [06:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:29] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:30:57] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [06:37:06] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514218 [06:38:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514218 (owner: 10Marostegui) [06:39:25] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514218 (owner: 10Marostegui) [06:39:55] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514218 (owner: 10Marostegui) [06:42:34] (03PS1) 10Elukey: role::analytics_cluster::coordinator: review Hive's GC settings [puppet] - 10https://gerrit.wikimedia.org/r/514219 (https://phabricator.wikimedia.org/T222895) [06:42:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1091 after upgrade (duration: 00m 48s) [06:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:21] (03PS2) 10Elukey: role::analytics_cluster::coordinator: review Hive's GC settings [puppet] - 10https://gerrit.wikimedia.org/r/514219 (https://phabricator.wikimedia.org/T222895) [06:44:56] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: review Hive's GC settings [puppet] - 10https://gerrit.wikimedia.org/r/514219 (https://phabricator.wikimedia.org/T222895) (owner: 10Elukey) [06:56:31] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:08] !log restart hive metastore on an-coord1001 to apply new GC/heap settings [06:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:55] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:51] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:09:44] (03PS1) 10Marostegui: db-codfw.php: Move db2058 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514224 (https://phabricator.wikimedia.org/T221533) [07:13:29] (03PS1) 10Marostegui: mariadb: Move db2058 to s6 [puppet] - 10https://gerrit.wikimedia.org/r/514225 (https://phabricator.wikimedia.org/T221533) [07:14:15] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Move db2058 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514224 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [07:15:10] (03Merged) 10jenkins-bot: db-codfw.php: Move db2058 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514224 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [07:16:03] !log mobrovac@deploy1001 Started deploy [restbase/deploy@abcb534]: Use only Proton for PDF rendering - T210651 [07:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:09] T210651: Switch all PDF render traffic to new Proton service - https://phabricator.wikimedia.org/T210651 [07:16:18] (03CR) 10jenkins-bot: db-codfw.php: Move db2058 to s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514224 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [07:16:22] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Move db2058 from s4 to s6 (duration: 00m 47s) [07:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:52] (03PS1) 10Ppchelko: Clean up configuration for pdfrender service. [puppet] - 10https://gerrit.wikimedia.org/r/514226 (https://phabricator.wikimedia.org/T210651) [07:20:28] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10jcrespo) For extra context, 503 errors can also happen randomly, the current stats say that 99.999512% of requests are suc... [07:20:36] (03CR) 10Ppchelko: [C: 04-1] "Self -1 until we're confident we want to undeploy it." [puppet] - 10https://gerrit.wikimedia.org/r/514226 (https://phabricator.wikimedia.org/T210651) (owner: 10Ppchelko) [07:21:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db2058 to s6 [puppet] - 10https://gerrit.wikimedia.org/r/514225 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [07:21:31] !log draining ganeti1002 for eventual reboot to MDS-enabled Linux kernel [07:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:58] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:22:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:57] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:28:55] (03CR) 10Ppchelko: [C: 04-1] "Puppet compiler: https://puppet-compiler.wmflabs.org/compiler1002/16846/" [puppet] - 10https://gerrit.wikimedia.org/r/514226 (https://phabricator.wikimedia.org/T210651) (owner: 10Ppchelko) [07:31:49] (03CR) 10Vgutierrez: [C: 03+2] ATS: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [07:31:59] (03PS19) 10Vgutierrez: ATS: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) [07:35:19] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@abcb534]: Use only Proton for PDF rendering - T210651 (duration: 19m 16s) [07:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:25] T210651: Switch all PDF render traffic to new Proton service - https://phabricator.wikimedia.org/T210651 [07:43:37] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:44:02] 10Operations, 10Security-Team: apache modsec rules deployment with scap - https://phabricator.wikimedia.org/T224887 (10elukey) The above change added a new key to the keyholder on the deployment hosts, and an alarm fired since the new key wasn't armed. I acked the alarms with this task as reference, please fol... [07:50:08] PROBLEM - Ensure trafficserver_exporter is running on cp4021 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:51:04] that's me [07:51:49] (the cp4021 one) [07:56:10] (03Restored) 10Ppchelko: [EventBus] Make EventFactory and event destination configurable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [07:56:33] (03PS1) 10Alexandros Kosiaris: Import SessionID handling patch to our package [software/otrs] - 10https://gerrit.wikimedia.org/r/514230 (https://phabricator.wikimedia.org/T210861) [07:58:16] (03PS1) 10Vgutierrez: ATS: Fix trafficserver-exporter nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/514231 (https://phabricator.wikimedia.org/T221594) [07:59:14] (03PS1) 10Marostegui: db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514232 (https://phabricator.wikimedia.org/T221533) [07:59:33] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Import SessionID handling patch to our package [software/otrs] - 10https://gerrit.wikimedia.org/r/514230 (https://phabricator.wikimedia.org/T210861) (owner: 10Alexandros Kosiaris) [08:00:22] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514232 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [08:01:03] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514232 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [08:01:58] uh? [08:02:12] (03CR) 10Marostegui: [C: 03+2] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514232 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [08:02:34] (03CR) 10Urbanecm: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507931 (https://phabricator.wikimedia.org/T222065) (owner: 10Ammarpad) [08:02:39] (03CR) 10Ema: [C: 03+1] ATS: Fix trafficserver-exporter nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/514231 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:03:11] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514232 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [08:03:23] (03PS7) 10Urbanecm: Add new namespaces for th.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (https://phabricator.wikimedia.org/T216322) (owner: 10Ammarpad) [08:03:25] (03CR) 10Vgutierrez: [C: 03+2] ATS: Fix trafficserver-exporter nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/514231 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:03:26] !log restart hive-server2 on an-coord1001 to pick up new GC/Heap settings [08:03:29] (03CR) 10jenkins-bot: db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514232 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [08:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:13] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2046 (duration: 00m 47s) [08:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:21] !log Stop MySQL on db2046 to clone db2058 - T221533 [08:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:26] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [08:06:55] RECOVERY - Ensure trafficserver_exporter is running on cp4021 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:13:10] (03CR) 10Urbanecm: [C: 04-1] "add HD logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507931 (https://phabricator.wikimedia.org/T222065) (owner: 10Ammarpad) [08:16:11] PROBLEM - Host kubestagetcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:16:30] (03PS1) 10Vgutierrez: ATS: Provide more accurate monitor descriptions [puppet] - 10https://gerrit.wikimedia.org/r/514234 (https://phabricator.wikimedia.org/T221217) [08:19:35] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:19:36] (03PS2) 10Urbanecm: Add localized project logo for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507931 (https://phabricator.wikimedia.org/T222065) (owner: 10Ammarpad) [08:20:10] (03Abandoned) 10Urbanecm: Enable Partial Blocks on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508122 (https://phabricator.wikimedia.org/T222258) (owner: 10Ammarpad) [08:20:27] RECOVERY - Host kubestagetcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [08:20:33] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:20:38] (03PS3) 10Urbanecm: Add localized project logo for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507931 (https://phabricator.wikimedia.org/T222065) (owner: 10Ammarpad) [08:20:51] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:21:53] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:21:58] (03PS5) 10Urbanecm: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) (owner: 10Ammarpad) [08:22:11] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:22:53] (03CR) 10jerkins-bot: [V: 04-1] Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) (owner: 10Ammarpad) [08:22:57] (03CR) 10Ema: [C: 03+1] ATS: Provide more accurate monitor descriptions [puppet] - 10https://gerrit.wikimedia.org/r/514234 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [08:23:19] !log draining ganeti1004 for eventual reboot to MDS-enabled Linux kernel [08:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:23] (03CR) 10Vgutierrez: [C: 03+2] ATS: Provide more accurate monitor descriptions [puppet] - 10https://gerrit.wikimedia.org/r/514234 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [08:24:06] (03CR) 10Urbanecm: [C: 04-1] "10:22:49 ----------------------------------------------------------------------" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) (owner: 10Ammarpad) [08:28:41] (03PS2) 10Elukey: Remove memcached config for nutcracker in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/510697 (https://phabricator.wikimedia.org/T214275) [08:29:03] (03PS3) 10Elukey: Remove memcached config for nutcracker in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/510697 (https://phabricator.wikimedia.org/T214275) [08:30:18] (03CR) 10Elukey: [C: 03+2] Remove memcached config for nutcracker in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/510697 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [08:32:01] !log remove memcached nutcracker config from mw1* hosts (not used). Changes will be picked up when nutcracker will be restarted (after reboots, etc..) - T214275 [08:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:10] T214275: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 [08:35:29] I am running puppet on icinga1001, but some alerts might fire [08:35:49] mmm actually no since I am not restarting nutcracker, nevermind [08:39:43] elukey: we will let that to the hands of fate [08:40:56] jijiki: as side effect of this change, we now monitor the nutcracker conns to redis, that is good :) [08:41:36] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/513694 (owner: 10CDanis) [08:43:15] (03PS4) 10Volans: icinga: add metamonitor user and its keyholder [puppet] - 10https://gerrit.wikimedia.org/r/511722 (https://phabricator.wikimedia.org/T222074) [08:43:18] PROBLEM - check_trafficserver_config_status on cp4022 is CRITICAL: NRPE: Command check_check_trafficserver_config_status not defined [08:43:44] sigh, chicken and egg problem I guess :) [08:44:05] (03PS12) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [08:44:14] jijiki: had to rebase because of conflicts --^ [08:44:41] * volans takes the number to add himself to the queue for puppet-merge [08:44:54] ahahaha [08:45:09] we should have a bot that does that :) [08:45:38] * volans would retry in ~5 [08:45:59] it is busy [08:47:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] conftool: use non-default ports for integration test etcd [software/conftool] - 10https://gerrit.wikimedia.org/r/513694 (owner: 10CDanis) [08:49:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Cool. thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/514029 (owner: 10Alexandros Kosiaris) [08:49:47] (03PS2) 10Alexandros Kosiaris: docker-registry: Page on LVS level failures [puppet] - 10https://gerrit.wikimedia.org/r/514029 [08:49:52] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] docker-registry: Page on LVS level failures [puppet] - 10https://gerrit.wikimedia.org/r/514029 (owner: 10Alexandros Kosiaris) [08:50:40] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10elukey) [08:50:42] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/514025 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [08:53:30] Anyone know why the post merge builds here were failing? https://integration.wikimedia.org/ci/job/service-pipeline-test-and-publish/199/console logs say "The connection to the server localhost:8080 was refused - did you specify the right host or port?" :) [08:54:27] PROBLEM - Host etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:34] (03CR) 10Elukey: "Had to fix a rebase conflict, nothing new added: https://puppet-compiler.wmflabs.org/compiler1002/16847/" [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [08:55:46] 10Operations, 10Maps: Find a better partitioning scheme for maps - https://phabricator.wikimedia.org/T224967 (10Mathew.onipe) [08:56:00] 10Operations, 10Maps: Find a better partitioning scheme for maps - https://phabricator.wikimedia.org/T224967 (10Mathew.onipe) [08:56:25] 10Operations, 10Maps: Find a better partitioning scheme for maps - https://phabricator.wikimedia.org/T224967 (10Mathew.onipe) p:05Triage→03High [08:56:36] etcd1001 is the ganeti reboot, will recover shortly [08:57:01] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:57:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:45] 10Operations, 10Maps: Find a better partitioning scheme for maps - https://phabricator.wikimedia.org/T224967 (10Gehel) For context: The maps servers have 2x900GB + 2x1.5TB disks. We are at the moment using RAID10 across those disks, so we're wasting a bunch of space. We could do better by doing RAID1 on the sa... [09:00:41] RECOVERY - Host etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [09:01:44] !log draining ganeti1005 for eventual reboot to MDS-enabled Linux kernel [09:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:04] (03PS5) 10Volans: icinga: add metamonitor user and its keyholder [puppet] - 10https://gerrit.wikimedia.org/r/511722 (https://phabricator.wikimedia.org/T222074) [09:02:07] * volans retry with the queue [09:02:10] *retries [09:02:16] (03PS2) 10Muehlenhoff: Remove support for trusty/Ubuntu in kernel/sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/498123 [09:02:34] (03PS3) 10Muehlenhoff: Remove support for Ubuntu/trusty in base packages [puppet] - 10https://gerrit.wikimedia.org/r/498126 [09:02:49] (03PS2) 10Muehlenhoff: Remove support for Ubuntu/trusty in monitoring/metrics base classes [puppet] - 10https://gerrit.wikimedia.org/r/498130 [09:03:09] (03PS2) 10Muehlenhoff: Remove support for Ubuntu in apt/debmonitor base classes [puppet] - 10https://gerrit.wikimedia.org/r/498134 [09:03:22] (03CR) 10Volans: [C: 03+2] icinga: add metamonitor user and its keyholder [puppet] - 10https://gerrit.wikimedia.org/r/511722 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [09:03:31] (03PS2) 10Muehlenhoff: base: Remove support for trusty/Ubuntu in multiple places [puppet] - 10https://gerrit.wikimedia.org/r/500400 [09:04:00] (03PS2) 10Muehlenhoff: Remove trusty-wikimedia from aptrepo config [puppet] - 10https://gerrit.wikimedia.org/r/500411 [09:06:05] \o/ [09:07:03] (03PS1) 10Tulsi Bhagat: Custom namespaces for ku.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514239 [09:12:54] (03PS12) 10Elukey: Script to generate service principals/keytabs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/470566 (owner: 10Muehlenhoff) [09:13:31] (03CR) 10jerkins-bot: [V: 04-1] Script to generate service principals/keytabs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/470566 (owner: 10Muehlenhoff) [09:14:11] PROBLEM - check_trafficserver_config_status on cp4026 is CRITICAL: NRPE: Command check_check_trafficserver_config_status not defined [09:14:32] expected noise [09:15:33] PROBLEM - Ensure traffic_manager is running on cp3047 is CRITICAL: NRPE: Command check_traffic_manager not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:15:33] PROBLEM - Ensure traffic_manager is running on cp3038 is CRITICAL: NRPE: Command check_traffic_manager not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:15:34] PROBLEM - check_trafficserver_config_status on cp3038 is CRITICAL: NRPE: Command check_check_trafficserver_config_status not defined [09:15:34] PROBLEM - Ensure traffic_server is running on cp3038 is CRITICAL: NRPE: Command check_traffic_server not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:15:39] PROBLEM - Ensure trafficserver_exporter is running on cp3047 is CRITICAL: NRPE: Command check_trafficserver_exporter not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:15:50] hmmm icinga is messing with me clearly ¬¬ [09:15:50] PROBLEM - check_trafficserver_config_status on cp3047 is CRITICAL: NRPE: Command check_check_trafficserver_config_status not defined [09:15:57] PROBLEM - Ensure trafficserver_exporter is running on cp4024 is CRITICAL: NRPE: Command check_trafficserver_exporter not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:15:57] PROBLEM - Ensure traffic_server is running on cp4024 is CRITICAL: NRPE: Command check_traffic_server not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:16:17] PROBLEM - Ensure traffic_manager is running on cp4024 is CRITICAL: NRPE: Command check_traffic_manager not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:16:21] PROBLEM - Ensure traffic_server is running on cp3047 is CRITICAL: NRPE: Command check_traffic_server not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:16:21] PROBLEM - Ensure traffic_server is running on cp4026 is CRITICAL: NRPE: Command check_traffic_server not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:16:33] PROBLEM - check_trafficserver_config_status on cp4024 is CRITICAL: NRPE: Command check_check_trafficserver_config_status not defined [09:17:08] PROBLEM - Ensure trafficserver_exporter is running on cp3046 is CRITICAL: NRPE: Command check_trafficserver_exporter not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:17:10] PROBLEM - Ensure traffic_server is running on cp3046 is CRITICAL: NRPE: Command check_traffic_server not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:17:28] cutting the branch in a few minutes, if nobody complains, please don't restart gerrit ;) [09:19:04] vgutierrez: need help? [09:19:18] nope, just bogus alerts due to checks getting renamed, sorry [09:19:39] no ATS instance has been harmed :( [09:19:42] (03PS1) 10Arturo Borrero Gonzalez: prometheus wmcs: store metrics from the secondary pdns server [puppet] - 10https://gerrit.wikimedia.org/r/514240 (https://phabricator.wikimedia.org/T224877) [09:20:01] yeah you cannot rename an NRPE check in one go without noise ;) [09:20:21] (03PS2) 10Tulsi Bhagat: Custom namespaces for ku.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514239 (https://phabricator.wikimedia.org/T224327) [09:20:37] I did it with 2 instances.. but clearly I got lucky [09:20:44] (03CR) 10Michael Große: [C: 03+1] Do not load InitialiseSettings-labs.php multiple times [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514038 (https://phabricator.wikimedia.org/T224899) (owner: 10WMDE-leszek) [09:21:07] add new one, wait 30m (or force run puppet), change check config to use new one in icinga, run puppet on icinga, remove the old one [09:21:11] (03CR) 10Tulsi Bhagat: "Requires `namespaceDupes.php --wiki=kuwiktionary --fix` after deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514239 (https://phabricator.wikimedia.org/T224327) (owner: 10Tulsi Bhagat) [09:21:27] PROBLEM - Host etcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [09:21:27] PROBLEM - Host kubestagetcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:22:27] ^expected? [09:22:30] yes [09:22:35] ganeti eqiad rolling reboot [09:22:47] and etcd hosts are not set up to be highly available on purpose [09:22:51] yeah, those will recover shortöy [09:23:35] (03PS4) 10Effie Mouzeli: profile::memcached::instance: Migrate to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/511963 [09:23:59] jijiki: https://phabricator.wikimedia.org/T224556 for more info [09:24:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1001/16849/" [puppet] - 10https://gerrit.wikimedia.org/r/514240 (https://phabricator.wikimedia.org/T224877) (owner: 10Arturo Borrero Gonzalez) [09:24:51] (03CR) 10Effie Mouzeli: [V: 03+1] "Yeap, I think I'd rather do a separate cleanup patch" [puppet] - 10https://gerrit.wikimedia.org/r/511973 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [09:25:07] !log disable puppet on mc* hosts to merge 511963 and 511973 [09:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:33] RECOVERY - Host kubestagetcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [09:25:51] RECOVERY - Host etcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [09:25:53] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={GET,LIST} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:26:23] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:27:18] !log draining ganeti1006 for eventual reboot to MDS-enabled Linux kernel [09:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:31] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:28:09] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:29:09] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [09:29:51] (03CR) 10Effie Mouzeli: [C: 03+2] profile::memcached::instance: Migrate to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/511963 (owner: 10Effie Mouzeli) [09:30:13] (03PS5) 10Effie Mouzeli: profile::memcached::instance: Migrate to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/511963 [09:39:06] (03CR) 10Jbond: [C: 03+2] firewall logging: enable logging on ores [puppet] - 10https://gerrit.wikimedia.org/r/511707 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [09:39:17] (03PS2) 10Jbond: firewall logging: enable logging on ores [puppet] - 10https://gerrit.wikimedia.org/r/511707 (https://phabricator.wikimedia.org/T116011) [09:43:20] (03PS2) 10Effie Mouzeli: profile::memcached::instance: Add -R 200 option [puppet] - 10https://gerrit.wikimedia.org/r/511973 (https://phabricator.wikimedia.org/T208844) [09:45:26] (03CR) 10Effie Mouzeli: [C: 03+2] profile::memcached::instance: Add -R 200 option [puppet] - 10https://gerrit.wikimedia.org/r/511973 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [09:54:27] (03PS13) 10Elukey: kerberos: add script to generate service principals/keytabs [puppet] - 10https://gerrit.wikimedia.org/r/470566 (https://phabricator.wikimedia.org/T212257) (owner: 10Muehlenhoff) [09:55:56] (03CR) 10Elukey: "Moritz: I moved the file to py extension, added some docs here and there (including argparse) and fixed some linter issues. If you are ok " [puppet] - 10https://gerrit.wikimedia.org/r/470566 (https://phabricator.wikimedia.org/T212257) (owner: 10Muehlenhoff) [09:56:05] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:57:18] (03CR) 10Elukey: "Better: I'll copy paste it on a node and start from there (manually), so I can refine this script as I go." [puppet] - 10https://gerrit.wikimedia.org/r/470566 (https://phabricator.wikimedia.org/T212257) (owner: 10Muehlenhoff) [09:58:08] (03CR) 10Muehlenhoff: "Sure, go ahead!" [puppet] - 10https://gerrit.wikimedia.org/r/470566 (https://phabricator.wikimedia.org/T212257) (owner: 10Muehlenhoff) [09:58:10] 10Operations, 10LDAP-Access-Requests: Request to be added to the ldap/wmf group - https://phabricator.wikimedia.org/T224744 (10fsero) a:03fsero [10:00:15] (03CR) 10Muehlenhoff: "You can also axe the whole chmodfile logic, this was only needed for the cloud vps project where the files were manually copied around. Wh" [puppet] - 10https://gerrit.wikimedia.org/r/470566 (https://phabricator.wikimedia.org/T212257) (owner: 10Muehlenhoff) [10:03:30] (03PS1) 10Effie Mouzeli: hiera::hosts: Remove host configs for memcached [puppet] - 10https://gerrit.wikimedia.org/r/514243 (https://phabricator.wikimedia.org/T208844) [10:03:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/498123 (owner: 10Muehlenhoff) [10:05:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/498126 (owner: 10Muehlenhoff) [10:06:18] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/498130 (owner: 10Muehlenhoff) [10:08:42] (03CR) 10Elukey: "I think that we can remove the other bit of the config for mc1019 and clean everything up, the perf experiment is basically over (no clea" [puppet] - 10https://gerrit.wikimedia.org/r/514243 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [10:09:46] (03CR) 10Jbond: "Missed an Ubuntu block" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498134 (owner: 10Muehlenhoff) [10:10:02] (03PS2) 10Effie Mouzeli: hiera::hosts: Remove host configs for memcached [puppet] - 10https://gerrit.wikimedia.org/r/514243 (https://phabricator.wikimedia.org/T208844) [10:10:25] (03CR) 10Effie Mouzeli: "done" [puppet] - 10https://gerrit.wikimedia.org/r/514243 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [10:11:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500411 (owner: 10Muehlenhoff) [10:12:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500400 (owner: 10Muehlenhoff) [10:13:04] Amir1: around? I need help cutting the branch, and looks like you can help :) T224972 [10:13:04] T224972: make-wmf-branch fails for /mediawiki/extensions/JADE - https://phabricator.wikimedia.org/T224972 [10:13:05] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Volans) a:05Volans→03None I'm leaving it back to the current clinic duty (@fsero) at this point given that it ne... [10:13:39] (03CR) 10Elukey: [C: 03+1] hiera::hosts: Remove host configs for memcached [puppet] - 10https://gerrit.wikimedia.org/r/514243 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [10:16:48] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) [10:16:51] 10Operations, 10Puppet, 10Patch-For-Review: facter 3: add timeout to custom facts external calls - https://phabricator.wikimedia.org/T223938 (10jbond) 05Open→03Resolved a:03jbond A CR for this has been deployed so the spam should be gone. please reopen if spam persists [10:20:33] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: https://lists.wikimedia.org/mailman/listinfo/wikimediapl-l has mixed encoding - https://phabricator.wikimedia.org/T111457 (10fsero) 05Open→03Declined This task has been inactive for 3 years, so I'm closing it please reopen if this still needed. [10:23:54] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloudservices: also allow prometheus scrapping from prod servers [puppet] - 10https://gerrit.wikimedia.org/r/514246 [10:24:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloudservices: also allow prometheus scrapping from prod servers [puppet] - 10https://gerrit.wikimedia.org/r/514246 (owner: 10Arturo Borrero Gonzalez) [10:36:03] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: "From" at start of line becomes ">From" in pipermail - https://phabricator.wikimedia.org/T115329 (10fsero) [10:36:50] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, 10serviceops, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10mobrovac) [10:37:22] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: "From" at start of line becomes ">From" in pipermail - https://phabricator.wikimedia.org/T115329 (10fsero) 05Stalled→03Declined This task has been inactive for 3 years, so I'm closing it please reopen if this still needed. [10:38:44] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Mailman: Consider hiding real list administrators email addresses - https://phabricator.wikimedia.org/T150164 (10fsero) 05Stalled→03Declined This task has been inactive for 2 years, so I'm closing it please reopen if this still needed. [10:39:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [10:41:00] (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: allow prometheus servers to reach the API endpoint [puppet] - 10https://gerrit.wikimedia.org/r/514254 [10:42:32] !log will start rolling reboots of mw1* servers 1t 10:50 [10:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:59] 10Operations, 10Wikimedia-Mailing-lists: Upgrade to mailman 2.1.22+ - https://phabricator.wikimedia.org/T166944 (10MarcoAurelio) Related somewhat is T52864 however not much progress (none) is happening there. If we ain't upgrading to v.3, I guess we should at least be upgrading to the last supported v.2 version. [10:43:13] 10Operations, 10Wikimedia-Mailing-lists: wikitech-l is mangling my PGP/MIME emails, causing signature validation to fail - https://phabricator.wikimedia.org/T186311 (10fsero) 05Open→03Declined This task has been inactive for 1 year, so I'm closing it please reopen if this still needed. [10:44:58] !log mw1* restarts will be delayed untill 11:15 [10:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:07] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10mobrovac) >>! In T220401#5230212, @akosiaris wrote: > > [...] In fact, some numbers I 've heard (I have no actual pro... [10:45:11] 10Operations, 10procurement: codfw: (10) core db systems - https://phabricator.wikimedia.org/T224973 (10Marostegui) [10:45:58] (03PS2) 10Arturo Borrero Gonzalez: openstack: designate: allow prometheus servers to reach the API endpoint [puppet] - 10https://gerrit.wikimedia.org/r/514254 [10:49:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1001/16852/" [puppet] - 10https://gerrit.wikimedia.org/r/514254 (owner: 10Arturo Borrero Gonzalez) [10:53:21] PROBLEM - Check systemd state on cloudservices1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:56:24] (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: service: fix ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/514258 [10:57:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: designate: service: fix ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/514258 (owner: 10Arturo Borrero Gonzalez) [11:00:05] Amir1, Lucas_WMDE, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190604T1100). [11:00:05] Urbanecm and Tulsi: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] I can SWAT today! [11:00:25] RECOVERY - Check systemd state on cloudservices1003 is OK: OK - running: The system is fully operational [11:00:31] Tulsi, are you around? [11:01:22] Urbanecm: few minutes delayed from prev deploy , fyi [11:01:32] Krinkle, okay, let me know when I can start [11:02:23] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:02:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:52] !log draining ganeti1007 for eventual reboot to MDS-enabled Linux kernel [11:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:42] * Krinkle staging on mwdebug1002 [11:06:00] (03PS1) 10Arturo Borrero Gonzalez: openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 [11:06:11] (03CR) 10Giuseppe Lavagetto: [C: 04-1] mcrouter: allow to tune server timeout and timeouts until tko (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [11:06:58] o/ [11:07:55] * Urbanecm waves to zeljkof [11:09:03] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.6/includes/: T221577 / 1286d131c01886 (duration: 01m 07s) [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:09] T221577: Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: LinksUpdate does not have outer scope - https://phabricator.wikimedia.org/T221577 [11:09:27] (03PS2) 10Arturo Borrero Gonzalez: openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 [11:10:09] Tulsi, around? [11:12:00] Krinkle, is the deployment host free now? [11:12:05] Urbanecm: yes, thanks [11:12:13] thanks, starting with my patches then [11:12:41] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (https://phabricator.wikimedia.org/T216322) (owner: 10Ammarpad) [11:13:42] (03Merged) 10jenkins-bot: Add new namespaces for th.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (https://phabricator.wikimedia.org/T216322) (owner: 10Ammarpad) [11:13:58] (03CR) 10jenkins-bot: Add new namespaces for th.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (https://phabricator.wikimedia.org/T216322) (owner: 10Ammarpad) [11:15:20] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494016 (https://phabricator.wikimedia.org/T217499) (owner: 10Ammarpad) [11:15:52] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:491054|Add new namespaces for th.wiki]] (T216322) (duration: 00m 47s) [11:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:56] T216322: Create a new namespaces on thai wikimedia projects - https://phabricator.wikimedia.org/T216322 [11:16:12] (03PS5) 10Urbanecm: Add editcontentmodel right to the templateeditor group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494016 (https://phabricator.wikimedia.org/T217499) (owner: 10Ammarpad) [11:17:42] (03CR) 10jenkins-bot: Add editcontentmodel right to the templateeditor group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494016 (https://phabricator.wikimedia.org/T217499) (owner: 10Ammarpad) [11:19:16] (03PS10) 10Urbanecm: Create new protection levels for dewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495918 (https://phabricator.wikimedia.org/T216885) (owner: 10Ammarpad) [11:19:27] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495918 (https://phabricator.wikimedia.org/T216885) (owner: 10Ammarpad) [11:19:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:494016|Add editcontentmodel right to the templateeditor group on testwiki]] (T217499) (duration: 00m 47s) [11:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:38] T217499: add editcontentmodel access to the templateeditor group on testwiki - https://phabricator.wikimedia.org/T217499 [11:20:04] (03PS3) 10Arturo Borrero Gonzalez: openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 [11:20:27] (03Merged) 10jenkins-bot: Create new protection levels for dewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495918 (https://phabricator.wikimedia.org/T216885) (owner: 10Ammarpad) [11:20:43] (03CR) 10jenkins-bot: Create new protection levels for dewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495918 (https://phabricator.wikimedia.org/T216885) (owner: 10Ammarpad) [11:20:57] (03CR) 10jerkins-bot: [V: 04-1] openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 (owner: 10Arturo Borrero Gonzalez) [11:22:33] (03PS3) 10Effie Mouzeli: hiera::hosts: Remove host configs for memcached [puppet] - 10https://gerrit.wikimedia.org/r/514243 (https://phabricator.wikimedia.org/T208844) [11:22:36] (03PS17) 10Urbanecm: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [11:22:43] (03PS4) 10Arturo Borrero Gonzalez: openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 [11:22:45] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [11:23:07] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:495918|Create new protection levels for dewiktionary]] (1/2, T216885) (duration: 00m 47s) [11:23:09] (03PS3) 10Alexandros Kosiaris: Introduce sessionstore LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/514025 (https://phabricator.wikimedia.org/T220401) [11:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:13] T216885: New protection levels on german wiktionary - https://phabricator.wikimedia.org/T216885 [11:23:20] (03PS4) 10Alexandros Kosiaris: Introduce sessionstore LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/514025 (https://phabricator.wikimedia.org/T220401) [11:23:34] (03CR) 10jerkins-bot: [V: 04-1] openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 (owner: 10Arturo Borrero Gonzalez) [11:23:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC seems happy at PS2, PS3 is just a minor change, merging" [puppet] - 10https://gerrit.wikimedia.org/r/514025 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [11:23:51] (03Merged) 10jenkins-bot: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [11:24:09] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: [[:gerrit:495918|Create new protection levels for dewiktionary]] (2/2, T216885) (duration: 00m 47s) [11:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:42] Urbanecm: Looks like I synced the wrong directory just now, (wmf.6 instead of wmf.7) so wmf.7 is currently unsafe to deploy from, just FYI - after you're done I'll re-sync. [11:25:03] Krinkle, I'm not going to touch anything in wmf.6 nor wmf.7, but thanks [11:25:34] Krinkle: is it because my comment? [11:26:15] jynus: No, it's because after I read your comment, I saw the log line says I synced wmf.6/includes, but I merged to wmf.7. [11:26:21] So my deploy did nothing [11:26:24] (03CR) 10jenkins-bot: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [11:26:46] yeah, I mean as in, my comment was related to that? [11:27:02] or is it still intended? [11:27:04] jynus: Yes, indeed. [11:27:07] ok [11:27:08] thanks [11:27:09] (03CR) 10Effie Mouzeli: [C: 03+2] hiera::hosts: Remove host configs for memcached [puppet] - 10https://gerrit.wikimedia.org/r/514243 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [11:27:11] It was intended to be fixed 10 minutes ago [11:27:16] cool, np [11:27:17] but I synced the wrong directory [11:27:22] it happens [11:27:23] the fix is now staged but not yet deployed [11:27:24] :-D [11:27:28] yeah :D [11:27:46] just was around poking with a stick in case :-P [11:27:55] to award you for fixing it! [11:27:58] :-D [11:28:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:486221|Add Author namespace in Sanskrit Wikisource]] (T214553) (duration: 00m 46s) [11:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:11] T214553: Create AUTHOR namespace in Sanskrit Wikisource - https://phabricator.wikimedia.org/T214553 [11:28:51] !log run mwscript namespaceDupes.php --wiki=thwiki --fix (T216322) [11:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:57] T216322: Create a new namespaces on thai wikimedia projects - https://phabricator.wikimedia.org/T216322 [11:28:59] (03PS5) 10Alexandros Kosiaris: Introduce sessionstore LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/514025 (https://phabricator.wikimedia.org/T220401) [11:29:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Introduce sessionstore LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/514025 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [11:29:47] !log running mwscript namespaceDupes.php --wiki=sawikisource --add-prefix=T214553 --fix (T214553) [11:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:50] (03PS4) 10Urbanecm: Add localized project logo for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507931 (https://phabricator.wikimedia.org/T222065) (owner: 10Ammarpad) [11:30:59] 10Operations, 10puppet-compiler: puppet-catalog-compiler: compilation result randomly places servers in the 'failed' section - https://phabricator.wikimedia.org/T224977 (10aborrero) [11:31:00] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507931 (https://phabricator.wikimedia.org/T222065) (owner: 10Ammarpad) [11:31:15] !log enabling puppet on mc2* [11:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:59] (03Merged) 10jenkins-bot: Add localized project logo for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507931 (https://phabricator.wikimedia.org/T222065) (owner: 10Ammarpad) [11:32:14] (03CR) 10jenkins-bot: Add localized project logo for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507931 (https://phabricator.wikimedia.org/T222065) (owner: 10Ammarpad) [11:33:37] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514239 (https://phabricator.wikimedia.org/T224327) (owner: 10Tulsi Bhagat) [11:33:59] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: [[:gerrit:507931|Add localized project logo for sahwikiquote]] (1/2, T222065) (duration: 00m 47s) [11:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:05] T222065: Updating Sakha Wikiquote logo - https://phabricator.wikimedia.org/T222065 [11:34:31] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 65 connections established with conf1004.eqiad.wmnet:4001 (min=66) https://wikitech.wikimedia.org/wiki/PyBal [11:35:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:507931|Add localized project logo for sahwikiquote]] (2/2, T222065) (duration: 00m 47s) [11:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:23] (03PS3) 10Urbanecm: Custom namespaces for ku.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514239 (https://phabricator.wikimedia.org/T224327) (owner: 10Tulsi Bhagat) [11:35:39] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "jenkins already suceeded, overriding to save time, forgot to rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514239 (https://phabricator.wikimedia.org/T224327) (owner: 10Tulsi Bhagat) [11:35:42] ^ expected [11:35:55] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.29:8081]) https://wikitech.wikimedia.org/wiki/PyBal [11:36:03] (03PS5) 10Arturo Borrero Gonzalez: openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 [11:36:03] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.29:8081]) https://wikitech.wikimedia.org/wiki/PyBal [11:36:10] also expected ^ [11:36:28] jijiki: "PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 65 connections established with conf1004.eqiad.wmnet:4001 (min=66) " [11:36:31] that one also? [11:36:42] sounds like it might be due to etcd pulls from MW but maybe not? [11:36:46] yes [11:36:59] ok :) [11:37:01] (03CR) 10jerkins-bot: [V: 04-1] openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 (owner: 10Arturo Borrero Gonzalez) [11:37:10] it is due to https://gerrit.wikimedia.org/r/514025 [11:37:26] cool [11:37:31] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:514239|Custom namespaces for ku.wiktionary]] (T224327) (duration: 00m 46s) [11:37:34] (03CR) 10jenkins-bot: Custom namespaces for ku.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514239 (https://phabricator.wikimedia.org/T224327) (owner: 10Tulsi Bhagat) [11:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:46] T224327: Custom namespaces for ku.wiktionary - https://phabricator.wikimedia.org/T224327 [11:38:37] !log run mwscript namespaceDupes.php --wiki=kuwiktionary --fix (T224327) [11:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:50] Krinkle, I'm done, feel free to re-sync [11:39:15] !log enabling puppet on mc1* [11:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:57] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 66 connections established with conf1004.eqiad.wmnet:4001 (min=66) https://wikitech.wikimedia.org/wiki/PyBal [11:39:58] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.7/includes/: T221577 / 1286d131c01886 (duration: 01m 04s) [11:40:07] !log restart pybal on lvs1015 for sessionstore LVS configuration. T220401 [11:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:31] T221577: Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: LinksUpdate does not have outer scope - https://phabricator.wikimedia.org/T221577 [11:40:41] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 39 connections established with conf2001.codfw.wmnet:2379 (min=41) https://wikitech.wikimedia.org/wiki/PyBal [11:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:56] T220401: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 [11:41:19] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:41:21] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:41:29] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:41:36] (03CR) 10CDanis: [C: 03+2] conftool: use non-default ports for integration test etcd [software/conftool] - 10https://gerrit.wikimedia.org/r/513694 (owner: 10CDanis) [11:42:35] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:43:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:43:43] that's ^ ok [11:43:51] the lvs1015 pybal think I mean [11:43:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:44:01] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:44:12] (03Merged) 10jenkins-bot: conftool: use non-default ports for integration test etcd [software/conftool] - 10https://gerrit.wikimedia.org/r/513694 (owner: 10CDanis) [11:44:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:44:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:44:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:44:33] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes2001.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:44:33] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:44:47] (03PS6) 10Arturo Borrero Gonzalez: openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 [11:45:24] (03CR) 10jerkins-bot: [V: 04-1] openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 (owner: 10Arturo Borrero Gonzalez) [11:45:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:46:21] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:46:31] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [11:46:39] o/ Urbanecm [11:47:00] Woah.. You can deploy. \o/ [11:47:13] Thank you so much! :) [11:47:29] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [11:47:51] seems to be cp1077 related [11:47:55] (03PS7) 10Arturo Borrero Gonzalez: openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 [11:47:55] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:48:00] Cc: vgutierrez, ema --^ [11:48:11] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:48:13] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [11:48:15] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:48:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:48:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:48:33] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:48:42] hmmm cp1077 varnish-be was restarted yesterday EU morning IIRC by ema [11:48:47] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:48:47] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:49:24] vgutierrez: not sure about cp1077, i noticed the backend listed in the 50x dashboard, might e wrong :) [11:50:55] (03PS1) 10Alexandros Kosiaris: lvs: sessionstore ProxyFetch over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/514265 [11:51:18] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] lvs: sessionstore ProxyFetch over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/514265 (owner: 10Alexandros Kosiaris) [11:51:29] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 41 connections established with conf2001.codfw.wmnet:2379 (min=41) https://wikitech.wikimedia.org/wiki/PyBal [11:51:30] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) [11:51:39] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10jijiki) 05Open→03Resolved [11:51:42] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10jijiki) [11:51:49] FYI I just added 'top 20 backend' to the 5xx dashboard [11:51:52] * akosiaris is to blame for icinga1001 puppet failing, will fix [11:52:41] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10fgiunchedi) I've added a 'top 20 backend' panel to https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X ! th... [11:53:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16858/" [puppet] - 10https://gerrit.wikimedia.org/r/514261 (owner: 10Arturo Borrero Gonzalez) [11:53:07] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [11:53:12] (03PS8) 10Arturo Borrero Gonzalez: openstack: drop hiera key profile::openstack::base::monitoring_host [puppet] - 10https://gerrit.wikimedia.org/r/514261 [11:53:23] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:53:35] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [11:53:46] there was a spike of upload errors on esams [11:53:48] (03PS1) 10Alexandros Kosiaris: Fix sessionstore LVS monitoring duplicate stanzas [puppet] - 10https://gerrit.wikimedia.org/r/514266 [11:54:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:54:23] actually text, too: https://grafana.wikimedia.org/d/000000464/prometheus-varnish-aggregate-client-status-code?orgId=1&var-site=codfw&var-site=eqiad&var-site=eqsin&var-site=esams&var-site=ulsfo&var-cache_type=varnish-text&var-cache_type=varnish-upload&var-status_type=5&from=1559648216377&to=1559649090531 [11:56:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix sessionstore LVS monitoring duplicate stanzas [puppet] - 10https://gerrit.wikimedia.org/r/514266 (owner: 10Alexandros Kosiaris) [11:56:27] (03PS2) 10Alexandros Kosiaris: Fix sessionstore LVS monitoring duplicate stanzas [puppet] - 10https://gerrit.wikimedia.org/r/514266 [11:56:36] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix sessionstore LVS monitoring duplicate stanzas [puppet] - 10https://gerrit.wikimedia.org/r/514266 (owner: 10Alexandros Kosiaris) [11:59:42] (03CR) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [11:59:49] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190604T1200) [12:00:33] elukey: I don't think it's strictly related to cp1077 cause the most common uri_host affected is upload.wm.o and cp1077 is a text cache server [12:01:22] vgutierrez, elukey: what's up? [12:01:36] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Patch-For-Review, and 2 others: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey) This task will be completed after the next round of reboots for the mc* hosts (as FYI fo... [12:01:49] bunch of 5xx (alerting) at 11:40 - 11:42 UTC [12:02:19] (03PS1) 10Arturo Borrero Gonzalez: prometheus: fix missing parentheses in openstack/rabbitmq exporter [puppet] - 10https://gerrit.wikimedia.org/r/514269 [12:02:24] vgutierrez: was it text, upload, or both? [12:02:34] both [12:02:56] mainly en.w.o [12:03:01] PROBLEM - puppet last run on cloudcontrol1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[ferm] [12:03:04] but commons was affected as well [12:03:25] there was a spike on fatals as well [12:03:33] hmmm commons.wm.o is text [12:03:40] so probably only text was affected [12:03:57] https://logstash.wikimedia.org/goto/8c827770af7e81358949be3e2b40f530 [12:03:58] vgutierrez: I do see also a (much smaller) spike for upload: https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&panelId=2&fullscreen&from=now-3h&to=now&var-site=All&var-cache_type=upload&var-status_type=5 [12:04:03] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 40 connections established with conf2001.codfw.wmnet:2379 (min=41) https://wikitech.wikimedia.org/wiki/PyBal [12:04:03] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:04:11] PROBLEM - puppet last run on cloudcontrol1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[ferm] [12:04:40] !log restart pybal on lvs2006 for sessionstore LVS configuration. T220401 [12:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:46] T220401: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 [12:04:54] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514270 [12:05:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] prometheus: fix missing parentheses in openstack/rabbitmq exporter [puppet] - 10https://gerrit.wikimedia.org/r/514269 (owner: 10Arturo Borrero Gonzalez) [12:06:15] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.29:8081]) https://wikitech.wikimedia.org/wiki/PyBal [12:06:27] (03PS1) 10Reedy: Update casing of Jade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514271 [12:07:28] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514270 (owner: 10Marostegui) [12:08:03] RECOVERY - puppet last run on cloudcontrol1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:08:14] zeljkof: Want to carry on with the train, and I'll clear up the Jade mess after? [12:08:19] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514270 (owner: 10Marostegui) [12:08:31] elukey: above you said that the issue seemd cp1077 related. Why? [12:08:37] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514270 (owner: 10Marostegui) [12:08:58] elukey: I see that fetches failed from all eqiad text backends: https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-1h&to=now [12:09:13] RECOVERY - puppet last run on cloudcontrol1004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:09:20] ema: number of occurrences of cp1077 in the varnish 5xx kibana dashboard [12:09:29] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2046 (duration: 00m 46s) [12:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:35] ema: yes you are right, but the 50x dashboard indicated an int in cp1077 in the top errors... [12:10:48] elukey: you're right too! [12:11:14] anyway, as I wrote before, I just wanted to ping you guys, that's it :) [12:11:18] (03CR) 10Vgutierrez: [C: 03+1] cache_upload: return HTTP 403 to requests violating UA policy [puppet] - 10https://gerrit.wikimedia.org/r/514017 (https://phabricator.wikimedia.org/T224891) (owner: 10Ema) [12:11:56] elukey: sure, thanks, ofc you want to help digging you make us happy :) [12:12:56] so, it seems to me that the issue was sporadic, but still it would be nice to find out more. Can someone please take care of that? I need to eat [12:13:15] !log restart pybal on lvs2003, lvs1015 for sessionstore LVS configuration. T220401 [12:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:20] T220401: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 [12:14:15] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:14:42] (03PS8) 10Vgutierrez: ATS: Provide a unified logs define [puppet] - 10https://gerrit.wikimedia.org/r/510641 (https://phabricator.wikimedia.org/T221217) [12:16:33] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:19:17] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [12:21:56] jouncebot, now [12:21:57] For the next 0 hour(s) and 38 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190604T1200) [12:22:04] !log ran mwscript deleteBatch.php --wiki=sawikisource -r '[[:phab:T214553|T214553]]: deleting useless red [12:22:04] irects' [12:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:11] T214553: Create AUTHOR namespace in Sanskrit Wikisource - https://phabricator.wikimedia.org/T214553 [12:24:42] as marostegui pointed out, at 11:40 UTC we have a spike of fatals screaming about mw timeouts (>60 secs) [12:27:01] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:29:25] vgutierrez: one nice follow up that we could do is to figure out how to properly read the new Top 20 backend panel that go*dog added to the 50x dashboard. Is the predominance of the "Varnish" backend normal since only the eqiad backends were suffering? (so all the others, fetching from them, reported as 'backend' 'Varnish') [12:29:56] trying to figure out how to read the new info that we have in logstash [12:30:20] if mw hosts were the problem, it should become easier to pin point the backend host and check in the httpd logs etc.. [12:32:02] elukey: so then we should figure out from where it comes that "Varnish" Referer [12:32:19] cause of course right now every text request that hits the mw servers goes via a varnish server in eqiad [12:32:23] (or codfw) [12:33:51] PROBLEM - LVS HTTP IPv4 on sessionstore.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:34:24] vgutierrez: yeah I agree.. I suspect that if a varnish backend fetches from another one, then the Backend respose header is "Varnish" [12:34:29] (pure speculation) [12:34:59] meanwhile, if a varnish backend fetches from mw (like the eqiad ones) then a proper hostname appears [12:38:04] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) @herron fixed [12:38:26] Reedy: sorry, I was out to lunch [12:38:36] All good :) [12:38:43] I'm not sure if I can continue [12:38:49] You can [12:38:56] branch cut stopped with jade [12:39:05] the rest of the extensions are cut? [12:39:07] 10Operations, 10Acme-chief, 10Traffic, 10HTTPS: acme-chief: Validate that configured certificates can be actually issued - https://phabricator.wikimedia.org/T220518 (10Vgutierrez) 05Open→03Resolved [12:39:10] 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Vgutierrez) [12:39:21] It's ages since I ran the make branch script... [12:39:31] PROBLEM - LVS HTTP IPv4 on sessionstore.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:39:35] * apergos peeks in looking hopeful [12:39:36] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) a:05Papaul→03herron [12:39:38] Try re-running it [12:39:56] It might just ignore it can't push the first lot, and carry on [12:39:57] Reedy: looks like Amir1's patch could fix the branch cut https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/508643 [12:40:32] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) [12:40:33] but I'm reluctant to merge it, I don't really know how the script works, or if this could make things worse :/ [12:40:40] I've already merged it [12:40:46] Just git pull to get the change [12:40:47] * setup an alreadyBranched array that has the names of all extensions [12:40:47] * up-to the extension from which we would like to start branching [12:40:56] ah, thanks, didn't notice it [12:41:09] ok, will pull and run the script again [12:41:22] do I need to delete wmf.8 folder? [12:41:28] (at mwdeploy) [12:41:46] (03PS1) 10Jcrespo: mariadb: Depool labsdb1011 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/514273 (https://phabricator.wikimedia.org/T221577) [12:42:37] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool labsdb1011 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/514273 (https://phabricator.wikimedia.org/T221577) (owner: 10Jcrespo) [12:44:08] zeljkof: It looks like the script will nuke the build directory if it already exists [12:44:22] So shouldn't matter either way [12:44:52] ok, I'll start in a few minutes, just to finish something else [12:49:20] elukey: yeah, indeed, this is mostly useful for 500s, not cache-layer 503s [12:52:17] PROBLEM - LVS HTTP IPv4 on sessionstore.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:53:59] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:54:58] <_joe_> akosiaris: that you ^^? [12:55:01] * akosiaris marks as downtimed the sessionstore pybal/lvs stuff [12:55:08] <_joe_> ok :D [12:56:29] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: services: unify primary/secondary roles [puppet] - 10https://gerrit.wikimedia.org/r/514279 (https://phabricator.wikimedia.org/T224743) [12:57:33] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:57:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190604T1300). [13:00:18] thank you jouncebot for the reminder [13:00:27] <_joe_> zeljkof: ping me when you start scap sync [13:00:38] everybody else, please don't break anything while I'm cutting the branch ;) [13:00:49] PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:52] _joe_: will do, but it will be a while, I'm still cutting the branch [13:00:59] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:01:03] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:01:05] <_joe_> akosiaris: ^^ [13:01:13] <_joe_> two etcd hosts donw? [13:01:17] <_joe_> oh different cluster [13:01:27] and the second one unused [13:01:32] <_joe_> I was about do faint :D [13:01:40] <_joe_> *to [13:01:52] yeah, I 've already vetted that and I 've taken precautions against it in ganeti already [13:02:20] that being said, we should reimage an etcd cluster with stretch and etcd3 [13:02:25] yeah, ganeti reboots, will recover shortly [13:02:26] but alas.. no time yet [13:05:27] RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [13:05:31] RECOVERY - Host etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [13:06:29] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={PATCH,POST} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:06:29] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={create,get} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:07:49] Reedy: ok, so just re-running the script doesn't work, I'm getting `To ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/AdvancedSearch ! [rejected] wmf/1.34.0-wmf.8 -> wmf/1.34.0-wmf.8 (fetch first)` [13:08:09] so I geuss I have to delete `/tmp/make-wmf-branch` first? [13:08:22] hashar: will that work? :) ^ [13:08:26] No? [13:08:36] Because it's gerrit that's rejecting it [13:08:52] https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/MakeWmfBranch.php#L45 [13:08:54] ah, so it's trying to cut at gerrit, right [13:09:18] so, I have to tell it to start with Jade, right? [13:09:31] I think so, it's just not clear how/where to call that function [13:10:00] oh [13:10:00] $obj->setStartExtension( $firstExtension ); [13:10:10] array( 'c', 'continue-from', 'firstExtension', false ), [13:10:18] looks like you just need to pass... [13:10:23] -c Jade [13:10:27] or -c=Jade [13:10:28] not sure which [13:10:45] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:10:45] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:11:38] thanks, looking... [13:12:08] !log draining ganeti1008 for eventual reboot to MDS-enabled Linux kernel [13:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:24] Reedy: found it https://github.com/wikimedia/mediawiki-tools-release/commit/48f23be0795a56ac9cd7e119580c6549649db421#diff-fbc8fe070b5dfc729e4961494050e0f1 [13:13:58] Yeah, hence me saying you need to pass -c Jade ;) [13:14:26] I just wasn't sure if = is needed or not :) [13:15:19] oh, well, the rest of the arguments don't have = [13:15:27] anyway, cutting the branch [13:15:52] zfilipin@deploy1001:~/release/make-wmf-branch$ ./make-wmf-branch -n 1.34.0-wmf.8 -o master -c Jade [13:16:02] [ERROR] Could not find extension 'Jade' in any branched Extension list [13:16:16] ah, it's probably mw/ext/jade, let me check... [13:16:18] Did you git pull? [13:16:22] Or that [13:16:38] I did, but it's probably not just Jade, but mw/ext/Jade [13:17:51] it's `extensions/Jade` [13:19:03] hm, this still errors out with AdvancedSearch? [13:19:08] `zfilipin@deploy1001:~/release/make-wmf-branch$ ./make-wmf-branch -n 1.34.0-wmf.8 -o master -c extensions/Jade` [13:19:35] maybe the order of the arguments is important? [13:21:21] -c extensions/Jade iirc [13:21:35] thcipriani: Your docs are wrong then ;P https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/make-wmf-branch#L22-L26 [13:22:21] yeah, I something changed here recently, haven't had time to look into what exactly :/ [13:23:08] (03PS1) 10Volans: gitignore: add __pycache__ [software/conftool] - 10https://gerrit.wikimedia.org/r/514287 [13:23:10] (03PS1) 10Volans: Rename variable to avoid shadowing built-in [software/conftool] - 10https://gerrit.wikimedia.org/r/514288 [13:23:12] (03PS1) 10Volans: Uniform quotes [software/conftool] - 10https://gerrit.wikimedia.org/r/514289 [13:23:15] (03PS1) 10Volans: tox: small cleanup, add .eggs to flake8 ignores [software/conftool] - 10https://gerrit.wikimedia.org/r/514290 [13:23:16] (03PS1) 10Volans: travis: add missing dependency [software/conftool] - 10https://gerrit.wikimedia.org/r/514291 [13:23:19] (03PS1) 10Volans: style: manually fix unchecked style violations [software/conftool] - 10https://gerrit.wikimedia.org/r/514292 [13:23:21] (03PS1) 10Volans: mock: use unittest.mock and remove 3rd party mock [software/conftool] - 10https://gerrit.wikimedia.org/r/514293 [13:24:07] zeljkof: what is the advancedsearch error? [13:24:49] it's still trying to push for that extension [13:24:58] `To ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/AdvancedSearch ! [rejected] wmf/1.34.0-wmf.8 -> wmf/1.34.0-wmf.8 (fetch first)` [13:26:40] (03CR) 10CDanis: [C: 03+1] gitignore: add __pycache__ [software/conftool] - 10https://gerrit.wikimedia.org/r/514287 (owner: 10Volans) [13:26:57] (03CR) 10CDanis: [C: 03+1] Rename variable to avoid shadowing built-in [software/conftool] - 10https://gerrit.wikimedia.org/r/514288 (owner: 10Volans) [13:27:15] (03CR) 10CDanis: [C: 03+1] Uniform quotes [software/conftool] - 10https://gerrit.wikimedia.org/r/514289 (owner: 10Volans) [13:27:17] (03CR) 10Volans: [C: 03+2] Rename variable to avoid shadowing built-in [software/conftool] - 10https://gerrit.wikimedia.org/r/514288 (owner: 10Volans) [13:27:35] (03CR) 10Volans: [C: 03+2] gitignore: add __pycache__ [software/conftool] - 10https://gerrit.wikimedia.org/r/514287 (owner: 10Volans) [13:27:39] (03CR) 10CDanis: [C: 03+1] tox: small cleanup, add .eggs to flake8 ignores [software/conftool] - 10https://gerrit.wikimedia.org/r/514290 (owner: 10Volans) [13:27:52] (03CR) 10CDanis: [C: 03+1] travis: add missing dependency [software/conftool] - 10https://gerrit.wikimedia.org/r/514291 (owner: 10Volans) [13:27:57] oh, `-o master` is not needed? :) https://github.com/wikimedia/mediawiki-tools-release/commit/eed75d9eccb49fae5b916d874472171d15e0f20a#diff-fbc8fe070b5dfc729e4961494050e0f1 [13:29:32] (03CR) 10Hashar: [C: 03+2] Uniform quotes [software/conftool] - 10https://gerrit.wikimedia.org/r/514289 (owner: 10Volans) [13:29:42] shouldn't break it though [13:29:51] (03CR) 10CDanis: [C: 03+1] style: manually fix unchecked style violations [software/conftool] - 10https://gerrit.wikimedia.org/r/514292 (owner: 10Volans) [13:30:03] (03Merged) 10jenkins-bot: gitignore: add __pycache__ [software/conftool] - 10https://gerrit.wikimedia.org/r/514287 (owner: 10Volans) [13:30:05] yeah, just something I've noticed while looking at git history [13:30:08] (03Merged) 10jenkins-bot: Rename variable to avoid shadowing built-in [software/conftool] - 10https://gerrit.wikimedia.org/r/514288 (owner: 10Volans) [13:30:10] 10Operations, 10Analytics: Reduce memory allocation for kafkamon instances - https://phabricator.wikimedia.org/T224988 (10MoritzMuehlenhoff) [13:30:14] ok, at a keyboard now. [13:30:15] 10Operations, 10Analytics: Reduce memory allocation for kafkamon instances - https://phabricator.wikimedia.org/T224988 (10MoritzMuehlenhoff) p:05Triage→03Normal [13:30:32] (03CR) 10Volans: [C: 03+2] tox: small cleanup, add .eggs to flake8 ignores [software/conftool] - 10https://gerrit.wikimedia.org/r/514290 (owner: 10Volans) [13:30:38] (03CR) 10Volans: [C: 03+2] travis: add missing dependency [software/conftool] - 10https://gerrit.wikimedia.org/r/514291 (owner: 10Volans) [13:30:56] !log starting rolling reboots of mw1* [13:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:20] (03CR) 10Hashar: [C: 03+2] tox: small cleanup, add .eggs to flake8 ignores [software/conftool] - 10https://gerrit.wikimedia.org/r/514290 (owner: 10Volans) [13:31:36] jbond42: Um, is now a good time to be doing reboots of mw hosts when it's supposed to be deploy train time? [13:32:05] (03Merged) 10jenkins-bot: Uniform quotes [software/conftool] - 10https://gerrit.wikimedia.org/r/514289 (owner: 10Volans) [13:32:12] oh sorry i hadn't noticed a deploy i can wait [13:32:22] (03CR) 10Hashar: "Why was CI / flake8 not complaining about those issues?" [software/conftool] - 10https://gerrit.wikimedia.org/r/514292 (owner: 10Volans) [13:33:00] (03Merged) 10jenkins-bot: tox: small cleanup, add .eggs to flake8 ignores [software/conftool] - 10https://gerrit.wikimedia.org/r/514290 (owner: 10Volans) [13:33:06] (03Merged) 10jenkins-bot: travis: add missing dependency [software/conftool] - 10https://gerrit.wikimedia.org/r/514291 (owner: 10Volans) [13:33:54] (03CR) 10Volans: [C: 03+2] "> Patch Set 1:" [software/conftool] - 10https://gerrit.wikimedia.org/r/514292 (owner: 10Volans) [13:34:22] (03PS2) 10Fsero: Decommision darmstadtium [puppet] - 10https://gerrit.wikimedia.org/r/514020 (https://phabricator.wikimedia.org/T224562) [13:35:04] Reedy, jbond42: thanks, we are a bit behind on train, still cutting the branch [13:35:33] zeljkof: no problem [13:36:28] (03Merged) 10jenkins-bot: style: manually fix unchecked style violations [software/conftool] - 10https://gerrit.wikimedia.org/r/514292 (owner: 10Volans) [13:36:40] (03CR) 10Muehlenhoff: "That seems fine, but still needs discussion/approval in the next SRE meeting, next week is the SRE offsite, so the week after that." [puppet] - 10https://gerrit.wikimedia.org/r/510753 (https://phabricator.wikimedia.org/T223463) (owner: 10Rush) [13:36:44] jbond42: for future reference, just take a quick look at this page if there's ongoing deployment window :) https://github.com/wikimedia/mediawiki-tools-release/commit/eed75d9eccb49fae5b916d874472171d15e0f20a#diff-fbc8fe070b5dfc729e4961494050e0f1 [13:36:52] ah, wrong link, sorry [13:37:01] https://wikitech.wikimedia.org/wiki/Deployments [13:37:08] thanks [13:37:43] jouncebot: now [13:37:43] For the next 1 hour(s) and 22 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190604T1300) [13:37:48] ^ or that [13:38:35] Lucas_WMDE: ah, good point, forgot about that, that might be easier [13:39:13] (03CR) 10Elukey: "Cole, thanks a ton for this work! I added a comment that might already be handled by the code of the exporter, that I wasn't able to fully" (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [13:39:23] thcipriani: thanks for staring at screens before coffee and breakfast [13:39:34] probably makes me less effective :) [13:40:09] although I think I see the problem... [13:40:39] (03CR) 10Fsero: [C: 03+2] Decommision darmstadtium [puppet] - 10https://gerrit.wikimedia.org/r/514020 (https://phabricator.wikimedia.org/T224562) (owner: 10Fsero) [13:42:30] other then php sucks? [13:43:56] somehow I thought all deployment tooling was in python [13:44:22] I guess that shows I didn't really look beyond scap [13:44:30] <_joe_> you can write python to look like php, anyways [13:44:57] :D [13:45:05] I think: https://gerrit.wikimedia.org/r/514296/ ought to do it [13:45:25] How long has that been broken? [13:45:29] still not sure what changed: maybe this pre-dates the change to config.json [13:45:47] branch cut doesn't fail often, so that's possible [13:46:07] thcipriani: should I merge it, pull at deploy1001 and try again? [13:46:51] !log fsero@cumin1001 START - Cookbook sre.hosts.decommission [13:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:56] zeljkof: yes please [13:46:57] !log fsero@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:02] 10Operations, 10Kubernetes, 10Patch-For-Review: Decommission darmstadtium - https://phabricator.wikimedia.org/T224562 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fsero@cumin1001 for hosts: `darmstadtium.eqiad.wmnet` - darmstadtium.eqiad.wmnet - Removed from Puppet master and Puppet... [13:48:01] thcipriani: will do, and thanks [13:48:14] thcipriani: What did you do when Jade broke stuff a few weeks ago? :P [13:48:19] you can continue with waking up now, I'll scream if needed ;) [13:49:31] Reedy: I re-ran with -c extensions/Jade IIRC. I may have removed the /tmp/wmf-whatever dir first? don't remember. Anyway, I did flail with -c Jade for a minute. [13:49:40] heh [13:51:02] (03CR) 10CDanis: [C: 03+2] mock: use unittest.mock and remove 3rd party mock [software/conftool] - 10https://gerrit.wikimedia.org/r/514293 (owner: 10Volans) [13:52:13] (03PS4) 10Muehlenhoff: Remove support for Ubuntu/trusty in base packages [puppet] - 10https://gerrit.wikimedia.org/r/498126 [13:52:18] thcipriani: it works! :D [13:52:26] * zeljkof is cutting the branch [13:52:32] zeljkof: great! [13:52:36] * thcipriani back to coffee [13:52:44] please don't reboot anything, even your laptops until I'm done with train ;) [13:53:03] !log restart mtail on lithium [13:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove support for Ubuntu/trusty in base packages [puppet] - 10https://gerrit.wikimedia.org/r/498126 (owner: 10Muehlenhoff) [13:53:55] (03Merged) 10jenkins-bot: mock: use unittest.mock and remove 3rd party mock [software/conftool] - 10https://gerrit.wikimedia.org/r/514293 (owner: 10Volans) [13:54:08] 10Operations, 10Kubernetes, 10Patch-For-Review: Decommission darmstadtium - https://phabricator.wikimedia.org/T224562 (10fsero) ` The last Puppet run was at Tue Jun 4 13:43:31 UTC 2019 (9 minutes ago). Debian GNU/Linux 9 auto-installed on Thu May 31 06:28:00 UTC 2018. Last login: Tue Jun 4 13:47:49 2019 f... [13:54:30] (03PS2) 10Fsero: decommision darmstadtium [dns] - 10https://gerrit.wikimedia.org/r/514024 (https://phabricator.wikimedia.org/T224562) [13:56:03] <_joe_> uhm icinga sees blubberoid down akosiaris [13:56:07] <_joe_> anything changed? [13:56:35] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1001.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:58:01] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:59:34] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10ema) @CDanis thank you so much for this! Very useful. Note that the Server response header will be set to `Varnish` for all synthe... [14:00:45] (03CR) 10Fsero: [C: 03+2] decommision darmstadtium [dns] - 10https://gerrit.wikimedia.org/r/514024 (https://phabricator.wikimedia.org/T224562) (owner: 10Fsero) [14:01:17] (03PS2) 10Jbond: firewall logging: Enable firewall logging on mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/511708 (https://phabricator.wikimedia.org/T116011) [14:01:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:01:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:02:04] <_joe_> I think that's restbase [14:02:16] <_joe_> but let me check [14:02:41] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:03:03] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:04:07] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:04:29] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:08:51] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:09:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:10:09] (03CR) 10Jbond: [C: 03+2] firewall logging: Enable firewall logging on mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/511708 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [14:10:13] <_joe_> that was a spike of errors on upload [14:12:11] _joe_: im not sure if theses metrics are based on prometheus however i noticed mtail was stuck and restarted it at around 14:00, wonder if this could be an artifact [14:12:27] <_joe_> no I don't think so [14:12:40] <_joe_> and no they come from varnishkafka IIRC [14:12:46] ack ok [14:13:07] _joe_: running `scap prep` [14:13:08] <_joe_> zeljkof: have you started the deployment? [14:13:21] branch cut just finished, finally [14:13:35] I had some trouble with Jade extension being renamed (from JADE) [14:14:30] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10CDanis) Indeed, thanks @ema ! I talked with @fgiunchedi some about this earlier and we tweaked the wording on the Logstash dashboa... [14:15:11] (03PS1) 10Ottomata: Use eventgate-main as default EventBus EventService in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514301 (https://phabricator.wikimedia.org/T211248) [14:15:29] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:16:06] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10fsero) [14:16:08] 10Operations, 10Kubernetes: Decommission darmstadtium - https://phabricator.wikimedia.org/T224562 (10fsero) 05Open→03Resolved [14:17:34] (03CR) 10Ppchelko: [C: 03+1] Use eventgate-main as default EventBus EventService in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514301 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [14:18:59] zeljkof: i'd like to merge and deploy a beta config change, it's no-op in prod. [14:19:49] ottomata: can it wait until the train window? [14:19:53] sure [14:20:00] (ends) in 40 minutes, if all goes well [14:20:10] ok, wasn't sure if you were already done. i will wait, thanks [14:20:35] ottomata: running late, just cut the branch, some trouble with that :/ [14:24:07] (03PS1) 10Zfilipin: Group0 to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514303 [14:25:29] _joe_: running `scap clean` [14:36:06] (03CR) 10Jhedden: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/514279 (https://phabricator.wikimedia.org/T224743) (owner: 10Arturo Borrero Gonzalez) [14:36:18] !log zfilipin@deploy1001 Pruned MediaWiki: 1.34.0-wmf.3 (duration: 11m 02s) [14:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:42] (03PS3) 10Muehlenhoff: Remove support for trusty/Ubuntu in kernel/sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/498123 [14:36:51] 10Operations, 10Traffic, 10observability: varnish: add X-Fetch-Error response header - https://phabricator.wikimedia.org/T224994 (10ema) [14:36:56] 10Operations, 10Traffic, 10observability: varnish: add X-Fetch-Error response header - https://phabricator.wikimedia.org/T224994 (10ema) p:05Triage→03Normal [14:37:12] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:38:12] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:39:26] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:39:28] !log zfilipin@deploy1001 Pruned MediaWiki: 1.34.0-wmf.4 [keeping static files] (duration: 01m 34s) [14:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:56] (03CR) 10Muehlenhoff: [C: 03+2] Remove support for trusty/Ubuntu in kernel/sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/498123 (owner: 10Muehlenhoff) [14:41:15] !log zfilipin@deploy1001 Pruned MediaWiki: 1.34.0-wmf.5 [keeping static files] (duration: 01m 38s) [14:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:30] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:43:07] (03CR) 10BBlack: [C: 03+1] cache_upload: return HTTP 403 to requests violating UA policy [puppet] - 10https://gerrit.wikimedia.org/r/514017 (https://phabricator.wikimedia.org/T224891) (owner: 10Ema) [14:43:21] !log zfilipin@deploy1001 Started scap: testwiki to php-1.34.0-wmf.8 and rebuild l10n cache [14:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:34] _joe_: running `scap sync` [14:43:43] <_joe_> zeljkof: ok thanks [14:43:53] (03CR) 10Andrew Bogott: [C: 03+1] "If the pcc is happy then this looks right to me." [puppet] - 10https://gerrit.wikimedia.org/r/514279 (https://phabricator.wikimedia.org/T224743) (owner: 10Arturo Borrero Gonzalez) [14:44:10] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 73797 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:44:36] (03PS2) 10Ema: cache_upload: return HTTP 403 to requests violating UA policy [puppet] - 10https://gerrit.wikimedia.org/r/514017 (https://phabricator.wikimedia.org/T224891) [14:44:42] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10jijiki) [14:45:31] (03CR) 10Ema: [C: 03+2] cache_upload: return HTTP 403 to requests violating UA policy [puppet] - 10https://gerrit.wikimedia.org/r/514017 (https://phabricator.wikimedia.org/T224891) (owner: 10Ema) [14:46:02] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:46:14] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes2001.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:46:31] zeljkof: have you finished? [14:46:31] PROBLEM - LVS HTTP IPv4 on blubberoid.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.31 and port 8748: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:46:44] jbond42: not even close :( [14:46:48] ok [14:46:54] could you ping me when done [14:47:06] what's that? [14:47:06] I'm hoping in 10-20 minutes, but hard to tell [14:47:13] jbond42: will do [14:47:16] blubberoid known/expected ? [14:47:17] no problem, thanks [14:47:17] blubberoid paged out [14:47:41] i think akosiaris was doing something with that [14:48:01] (03PS1) 10BBlack: cache: reimage cp3035 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/514306 (https://phabricator.wikimedia.org/T222937) [14:48:09] yeah it's known and should not have paged he says [14:48:13] <_joe_> yes, and blubberoid shouldn't page :/ [14:48:24] ack, thanks [14:48:29] yeah ignore it [14:48:35] should recover pretty soon [14:48:39] <_joe_> akosiaris: can we make it non-paging? [14:48:43] <_joe_> :P [14:48:44] yes [14:49:19] <_joe_> zeljkof: has the code been synced to the servers already? [14:49:21] RECOVERY - LVS HTTP IPv4 on blubberoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 7832 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:49:50] _joe_: syncing to testwiki now [14:50:28] (03CR) 10BBlack: [C: 03+2] cache: reimage cp3035 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/514306 (https://phabricator.wikimedia.org/T222937) (owner: 10BBlack) [14:50:52] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:51:03] !log depool cp3035 for ATS reimage - T222937 [14:51:06] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:51:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:09] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:33] and the kubelet alert this time around is actually correct. It should have alerted. With all the mass restarts that are going on [14:51:36] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:51:40] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:51:44] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:51:58] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:52:21] _joe_, moritzm, robh, can one of you set the clinic duty to fsero in the topic? [14:52:37] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3035.esams.wmnet'] ` The log can be found i... [14:52:39] =] [14:53:00] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:54:28] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:54:35] XioNoX: ack, doing that now [14:55:06] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:55:30] (03CR) 10Arturo Borrero Gonzalez: "Here is PCC https://puppet-compiler.wmflabs.org/compiler1001/16864/" [puppet] - 10https://gerrit.wikimedia.org/r/514279 (https://phabricator.wikimedia.org/T224743) (owner: 10Arturo Borrero Gonzalez) [14:55:33] XioNoX: sorry I cannot [14:55:45] cp3035 seems down? [14:55:58] XioNoX: someone was faster, but done :-) [14:56:10] oh, I just saw the reimage [14:59:17] <_joe_> zeljkof: how long before you're done, you think? [14:59:48] PROBLEM - Host kubestagetcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:48] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:55] (03CR) 10Vgutierrez: [C: 03+2] ATS: Provide a unified logs define [puppet] - 10https://gerrit.wikimedia.org/r/510641 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [15:00:13] (03PS9) 10Vgutierrez: ATS: Provide a unified logs define [puppet] - 10https://gerrit.wikimedia.org/r/510641 (https://phabricator.wikimedia.org/T221217) [15:00:28] _joe_: scap is at sync-apaches, so a few more minutes, I think [15:00:32] RECOVERY - Host kubestagetcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [15:00:41] ^ will etcds will recover soon [15:01:04] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [15:01:13] (03PS14) 10Elukey: kerberos: add script to generate service principals/keytabs [puppet] - 10https://gerrit.wikimedia.org/r/470566 (https://phabricator.wikimedia.org/T212257) (owner: 10Muehlenhoff) [15:01:18] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: cp1080_v4, cp1080_v6, cp1084_v4, cp1084_v6, cp1086_v4, cp1086_v6, cp1088_v4, cp1088_v6, cp1090_v4, cp1090_v6, cp2005_v4, cp2005_v6, cp2022_v4, cp2022_v6, cp2024_v4, cp2024_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:01:20] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:01:22] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:01:28] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:01:32] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:01:50] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:01:50] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:01:52] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:01:58] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:01:58] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:00] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:00] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:03] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:10] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:16] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:16] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:16] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:18] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:20] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:20] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:22] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:02:28] ^^ bblack im gussing this is the reimage [15:03:28] hmm, yeah [15:03:30] sorry! [15:03:54] we had one node in this set that had been dead for hardware for a long time, and all the related ipsec alerts were ack'd off from that situation. [15:04:23] I removed the hardware-dead node from config the other day, which made all the strongswan alerts recover and re-arm, which then leads to them alerting now on the next reimage in this cluster :P [15:04:49] ahh ok :) [15:04:56] !log failover Ganeti master in eqiad to ganeti1001 [15:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:08] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:05:12] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:34] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:34] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:05:41] they'll all recover over a ~30 min window a little later when they realize cp3035 isn't part of the ipsec node set anymore [15:05:51] zeljkof: lemme know when i can go [15:05:57] <_joe_> akosiaris: ^^ you might want to check the etcd cluster in eqiad for k8s [15:06:28] (03PS1) 10Ema: Add 0022-deref-objcore-synth-err.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514315 [15:06:32] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:06:39] _joe_, ottomata, jbond42 (not sure if I'm missing somebody): I need more time to finish train [15:06:45] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Return HTTP 403 to requests in violation of User-Agent policy - https://phabricator.wikimedia.org/T224891 (10Legoktm) For Tech News: Bots and other scripts that do not set an identifiable [[https://meta.wikimedia.org/wiki/User-Agent_policy|User-Ag... [15:06:50] current estimate is 10-20 more minutes [15:06:52] np zeljkof just lemme know :) [15:06:53] k! [15:06:54] hoping for 10 [15:06:56] <_joe_> zeljkof: what happened? [15:07:00] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:07:34] _joe_: I had trouble cutting the branch, JADE extension got renamed to Jade, then a tool that cuts the branch had a bug so I was stuck until thcipriani woke up... [15:07:56] _joe_: that's the staging cluster and the ganeti reboots [15:08:09] <_joe_> neon is staging? [15:08:25] <_joe_> right it's kubemaster for prod [15:08:58] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 71.55, 33.73, 20.51 [15:09:36] <_joe_> we;re at the rebuild-cdb stage zeljkof? [15:09:55] _joe_: correct, scap-cdb-rebuild [15:09:59] at 50% [15:10:06] looks like it [15:10:52] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 62.46, 28.91, 19.31 [15:12:03] <_joe_> it's recovering fwiw [15:13:06] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 21.67, 27.18, 20.20 [15:13:07] !log zfilipin@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.8 and rebuild l10n cache (duration: 29m 46s) [15:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:25] !log draining ganeti1003 for eventual reboot to MDS-enabled Linux kernel [15:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:44] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:13:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:13:45] success [15:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:53] <_joe_> is it just testwiki? [15:14:24] <_joe_> I thought group0 included mediawiki.org [15:14:39] usually they just move testwiki sepearate to generate the l10n cache etc all in one go [15:14:45] rest of group0 should take a minute or so to move [15:14:51] (03CR) 10jerkins-bot: [V: 04-1] Add 0022-deref-objcore-synth-err.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514315 (owner: 10Ema) [15:14:55] ^ what reedy said [15:15:01] (03CR) 10Zfilipin: [C: 03+2] Group0 to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514303 (owner: 10Zfilipin) [15:15:08] it just needs a wikiversions sync, which is that ^ :) [15:15:37] on it [15:15:45] <_joe_> this is good for my fears around php-fpm, btw [15:15:52] waiting for this to merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/514303 [15:16:00] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514303 (owner: 10Zfilipin) [15:16:20] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 36 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:16:24] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514303 (owner: 10Zfilipin) [15:16:26] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 42 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:16:28] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 42 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:16:40] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:16:42] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 42 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:16:44] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 42 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:16:46] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:17:08] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:17:10] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:17:18] _joe_: running `scap sync-wikiversions` (not sure if you care about that) [15:17:34] <_joe_> zeljkof: thanks [15:17:46] <_joe_> I do, thank you [15:18:08] <_joe_> I'm looking at https://grafana.wikimedia.org/d/GuHySj3mz/php7-transition?refresh=30s&panelId=5&fullscreen&orgId=1&from=now-1h&to=now to ensure we don't run out of opcache [15:18:14] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.8 [15:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:36] <_joe_> this is the first train deploy since we stopped pushing for an opcache reset [15:19:02] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:19:39] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10wmerrors, and 6 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10herron) [15:19:52] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 42 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:19:54] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:20:04] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:20:07] is the sessionstore one related to any current WIP? [15:20:15] no [15:20:30] it's me finishing the deployment after a messup [15:20:42] _joe_, ottomata, jbond42: I'm done, if nothing explodes in the next 5-10 minutes, you can continue with your activities [15:20:46] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/trafficserver/lua/] [15:21:07] zeljkof: thanks [15:21:32] <_joe_> you can see the effects of the train deploy on the available opcache [15:21:41] k! [15:21:46] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:22:04] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 14.52, 15.11, 16.61 [15:22:04] PROBLEM - Check systemd state on cp3045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:23:15] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10MSantos) [15:23:18] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:24:48] 10Operations, 10Traffic, 10User-notice: Return HTTP 403 to requests in violation of User-Agent policy - https://phabricator.wikimedia.org/T224891 (10TheDJ) Not sure if it applies here, but please remember that we allow `Api-User-Agent` as an alternative to `User-Agent` for Javascript solutions. [15:25:06] (03CR) 10Ottomata: [C: 03+2] Use eventgate-main as default EventBus EventService in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514301 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [15:25:14] proceeding [15:25:34] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:26:20] (03PS1) 10Alexandros Kosiaris: kask: Actually ship affinity correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/514317 [15:26:23] (03CR) 10jenkins-bot: Use eventgate-main as default EventBus EventService in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514301 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [15:26:42] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] kask: Actually ship affinity correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/514317 (owner: 10Alexandros Kosiaris) [15:27:02] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Use eventgate-main in beta. No-op in prod. T211248 (duration: 00m 49s) [15:27:04] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3035.esams.wmnet'] ` and were **ALL** successful. [15:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:08] T211248: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [15:27:32] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 42 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:29:22] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 42 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:29:24] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:29:24] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:29:24] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:29:32] (03PS6) 10Jbond: monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) [15:30:28] (03PS1) 10Ema: Add 0023-pass-delivery-is-no-err.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514318 [15:30:30] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [15:30:35] (03CR) 10Jbond: "If people could review the module/daemons they are familiar with and let me know if there are more appropriate links that would be appreci" [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [15:31:16] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 42 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:32:31] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10wmerrors, and 6 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10herron) Long JSON messages to ELK are being truncated since T187147#5182892 which addresses "Changes to rsyslog/Kafka mean that la... [15:33:05] _joe_, ottomata, jbond42: train looks ok, proceed with your plans, and apologies for the delay [15:34:09] (03CR) 10Cwhite: "> Patch Set 8:" (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [15:34:35] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10akosiari... [15:34:41] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) 05Open→03Resolved a:03akosiaris And LVS done today. ` akosiaris@deploy1001:~$ curl -i https://sess... [15:34:46] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [15:34:56] (03PS9) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) [15:36:22] PROBLEM - Ensure traffic_manager is running for instance backend on cp3045 is CRITICAL: NRPE: Command check_traffic_manager_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:36:42] (03PS2) 10ArielGlenn: Add zh variants to beta wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/511145 (https://phabricator.wikimedia.org/T223770) (owner: 10Reedy) [15:37:53] (03CR) 10ArielGlenn: [C: 03+2] Add zh variants to beta wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/511145 (https://phabricator.wikimedia.org/T223770) (owner: 10Reedy) [15:38:22] PROBLEM - Ensure traffic_server is running for instance backend on cp3045 is CRITICAL: NRPE: Command check_traffic_server_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:38:42] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:38:44] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) >>! In T220401#5233899, @akosiaris wrote: > > And LVS done today. > > ` > akosiaris@deploy1001:~$ curl -i h... [15:39:00] (03CR) 10jerkins-bot: [V: 04-1] Add 0023-pass-delivery-is-no-err.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514318 (owner: 10Ema) [15:39:17] fsero: hi, please don't close tasks just because they're inactive [15:39:32] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: "From" at start of line becomes ">From" in pipermail - https://phabricator.wikimedia.org/T115329 (10Legoktm) 05Declined→03Open Still an issue. [15:39:45] (03CR) 10Vgutierrez: [C: 03+2] ATS: Ensure proper permissions for ATS layouts [puppet] - 10https://gerrit.wikimedia.org/r/512643 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [15:40:19] (03CR) 10Vgutierrez: [C: 03+2] ATS: Provide parent proxies support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/511869 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [15:40:20] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp3045 is CRITICAL: NRPE: Command check_trafficserver_exporter_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:40:43] 10Operations, 10Wikimedia-Mailing-lists: wikitech-l is mangling my PGP/MIME emails, causing signature validation to fail - https://phabricator.wikimedia.org/T186311 (10Legoktm) 05Declined→03Open Still an issue as far as I can tell. I'll try sending another PGP/MIME email to wikitech-l later this week to ve... [15:40:45] (03PS7) 10Vgutierrez: ATS: Provide parent proxies support [puppet] - 10https://gerrit.wikimedia.org/r/511869 (https://phabricator.wikimedia.org/T221594) [15:41:03] ema: can you peek at that? cp3045/trafficserver_exporter ? [15:41:05] jouncebot [15:41:07] jouncebot: now [15:41:07] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [15:41:10] jouncebot: next [15:41:10] In 0 hour(s) and 18 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190604T1600) [15:41:32] !log reboot cp3035 post-reimage [15:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:38] uh... [15:41:41] cp3045 bblack [15:41:42] bblack: are you sure you want to reboot 3035? [15:41:54] :) [15:42:18] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 41 connections established with conf2001.codfw.wmnet:2379 (min=41) https://wikitech.wikimedia.org/wiki/PyBal [15:42:41] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Mailman: Consider hiding real list administrators email addresses - https://phabricator.wikimedia.org/T150164 (10Legoktm) 05Declined→03Open Still an issue. [15:42:46] my reimage was all about 3035 [15:42:56] I don't know what's going on with 3045 other than the alert above [15:43:08] bblack: mmh, no, your reimage was cp3045 looking at the patch [15:43:17] oh [15:43:21] commit log says cp3035 [15:43:25] well that's what I get for multitasking then [15:43:31] I'llgo sort it out and clean it up :P [15:43:52] all my open windows were on 3035, and the commit msg is 3035, but yeah commit contents and message didn't match [15:43:59] ack :) [15:44:38] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:44:56] (03PS7) 10Jbond: monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) [15:45:13] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10Tgr) [15:46:00] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [15:46:21] (03CR) 10CDanis: monitoring: add notes url for memory errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [15:46:35] (03PS3) 10Ottomata: [EventBus] Add eventgate-main event service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510299 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [15:46:38] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:47:16] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:47:16] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [15:47:41] bblack: are the upload@esams avail alerts related to the reimage? [15:48:08] or do we need to investigate? [15:48:18] (03CR) 10Ottomata: [C: 03+2] [EventBus] Add eventgate-main event service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510299 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [15:48:32] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, 10serviceops, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Tgr) Doesn't that sort of duplicate #jade? [15:48:36] (03CR) 10jenkins-bot: [EventBus] Add eventgate-main event service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510299 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [15:49:44] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:50:34] !log otto@deploy1001 Synchronized wmf-config/CommonSettings.php: Configure eventgate-main EventService. No-op in prod. T211248 (duration: 01m 19s) [15:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:39] T211248: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [15:51:20] legoktm: Hi, don't see the point to keep a bunch of inactive tasks polluting phab. However if you think it should be still open let's keep it that way :) [15:52:14] fsero: they're still issues, are they not? if anything it mostly just identifies the symptom that no one is really maintaining our mailman install [15:52:27] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/Jade: Consistency (duration: 01m 08s) [15:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:33] (03PS2) 10Reedy: Update casing of Jade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514271 [15:52:37] (03CR) 10Reedy: [C: 03+2] Update casing of Jade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514271 (owner: 10Reedy) [15:52:47] (03PS8) 10Jbond: monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) [15:52:57] (03CR) 10Jbond: monitoring: add notes url for memory errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [15:53:35] (03Merged) 10jenkins-bot: Update casing of Jade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514271 (owner: 10Reedy) [15:53:51] (03CR) 10jenkins-bot: Update casing of Jade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514271 (owner: 10Reedy) [15:53:52] PROBLEM - check_trafficserver_log_fifo_notpurge_backend on cp3045 is CRITICAL: CRITICAL: /var/log/trafficserver/notpurge.pipe - does not exist [15:55:01] !log reedy@deploy1001 Synchronized wmf-config/extension-list: JADE - T212182 (duration: 00m 53s) [15:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:06] T212182: Rename JADE->Jade in beta cluster configuration - https://phabricator.wikimedia.org/T212182 [15:55:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:55:10] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [15:55:48] PROBLEM - check_trafficserver_log_fifo_purge_backend on cp3045 is CRITICAL: CRITICAL: /var/log/trafficserver/purge.pipe - does not exist [15:56:34] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: JADE - T212182 (duration: 00m 53s) [15:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:44] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp3045 is CRITICAL: connect to address 10.20.0.180 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:59:31] 10Operations, 10Jade, 10Scoring-platform-team, 10TechCom, and 4 others: Deploy Jade extension to production - https://phabricator.wikimedia.org/T183381 (10Reedy) [16:00:04] godog and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190604T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:04:52] 10Operations, 10ops-eqiad: rack/setup 3 new single cpu spare pool systems - https://phabricator.wikimedia.org/T219890 (10Cmjohnson) [16:08:24] !log depool cp3045 for reimage - T222937 [16:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:30] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [16:09:16] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3045.esams.wmnet [16:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:23] ACKNOWLEDGEMENT - Device not healthy -SMART- on restbase-dev1006 is CRITICAL: cluster=restbase_dev device=sdd instance=restbase-dev1006:9100 job=node site=eqiad Ayounsi https://phabricator.wikimedia.org/T223825 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1006&var-datasource=eqiad+prometheus/ops [16:10:32] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3045.esams.wmnet'] ` The log can be found i... [16:12:30] (03PS3) 10Muehlenhoff: Remove support for Ubuntu in apt/debmonitor base classes [puppet] - 10https://gerrit.wikimedia.org/r/498134 [16:12:43] (03CR) 10Muehlenhoff: Remove support for Ubuntu in apt/debmonitor base classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498134 (owner: 10Muehlenhoff) [16:12:46] !log starting rolling reboots of mw1* [16:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:10] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:13:14] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:20] there will probably be more ipsec alerts [16:13:52] (03CR) 10jerkins-bot: [V: 04-1] Remove support for Ubuntu in apt/debmonitor base classes [puppet] - 10https://gerrit.wikimedia.org/r/498134 (owner: 10Muehlenhoff) [16:14:20] !log repool cp3035 (still varnish-be, but freshly installed!) [16:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:26] (03CR) 10Elukey: ">" (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [16:22:05] 10Operations, 10ops-codfw: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) [16:23:20] PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:27] 10Operations, 10ops-codfw: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) That did the trick! All of the new codfw kafka-main hosts are now installed and ready for service setup [16:23:37] 10Operations, 10ops-codfw: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) [16:23:38] RECOVERY - Host mw1294 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:23:48] 10Operations, 10ops-codfw: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) 05Open→03Resolved [16:24:50] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, 10serviceops, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Mholloway) Hmm, I guess they're aimed in a similar way at editor ratification of others' judgments (in the case of JADE,... [16:25:35] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:25:37] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:25:38] (03PS1) 10Alexandros Kosiaris: sessionstore: Fix LVS service icinga check [puppet] - 10https://gerrit.wikimedia.org/r/514326 [16:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:40] 10Operations, 10netops: Investigate cr2-eqord's disconnection from the rest of the network - https://phabricator.wikimedia.org/T224535 (10ayounsi) 1/ outstanding alert, I think this is due to the alert being triggered right before a 3 days weekend and people not paying enough attention to active Icinga alerts.... [16:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) [16:28:37] (03PS1) 10Alexandros Kosiaris: sessionstore: Use /openapi, not /?spec for validation [puppet] - 10https://gerrit.wikimedia.org/r/514327 [16:28:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] sessionstore: Fix LVS service icinga check [puppet] - 10https://gerrit.wikimedia.org/r/514326 (owner: 10Alexandros Kosiaris) [16:29:15] (03PS1) 10Cmjohnson: Setting up mgmt ip for wmf5177/wmf5178 [dns] - 10https://gerrit.wikimedia.org/r/514328 [16:29:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) [16:29:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] sessionstore: Use /openapi, not /?spec for validation [puppet] - 10https://gerrit.wikimedia.org/r/514327 (owner: 10Alexandros Kosiaris) [16:30:19] (03CR) 10Volans: wdqs: add WDQS restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [16:32:45] !log Compress some more tables on labsdb1012 before upgrading the host tomorrow T222978 [16:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:59] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [16:33:01] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [16:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:06] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [16:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:11] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [16:33:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labstore1001.eqiad.wmnet` -... [16:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:17] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [16:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labstore1002.eqiad.wmnet` -... [16:33:27] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [16:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:33] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [16:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labstore1003.eqiad.wmnet` -... [16:35:40] RECOVERY - LVS HTTP IPv4 on sessionstore.svc.eqiad.wmnet is OK: HTTP OK: Status line output matched 200 - 126 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:38:38] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:39:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) Switch port info: labstore1001:asw2-c-eqiad:ge-2/0/15 labstore1002:asw2-c-eqiad:ge-3/0/5 labstore1003:asw2-a-e... [16:40:42] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:40:45] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:40:45] (03PS1) 10Alexandros Kosiaris: Remove duplicate /openapi from sessionstore OpenAPI checks [puppet] - 10https://gerrit.wikimedia.org/r/514330 [16:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove duplicate /openapi from sessionstore OpenAPI checks [puppet] - 10https://gerrit.wikimedia.org/r/514330 (owner: 10Alexandros Kosiaris) [16:47:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "We can merge this now!" [puppet] - 10https://gerrit.wikimedia.org/r/505886 (https://phabricator.wikimedia.org/T140100) (owner: 10Muehlenhoff) [16:48:11] (03PS1) 10Ayounsi: Network monitoring, make core router down page [puppet] - 10https://gerrit.wikimedia.org/r/514332 (https://phabricator.wikimedia.org/T224535) [16:50:39] (03PS1) 10RobH: remove labstore100[123] repo entries [puppet] - 10https://gerrit.wikimedia.org/r/514343 (https://phabricator.wikimedia.org/T187456) [16:50:43] PROBLEM - Host mw1233 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:43] PROBLEM - Host mw1231 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:43] PROBLEM - Host mw1232 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:45] PROBLEM - Host mw1234 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:49] RECOVERY - Host mw1232 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:50:51] PROBLEM - Host mw1300 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:53] RECOVERY - Host mw1234 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [16:50:53] RECOVERY - Host mw1233 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:50:53] RECOVERY - Host mw1231 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:51:29] do we have any idea where did ^ come from ? [16:53:10] nope! they are in row d and i just pushed a switch config change to row a and c (disabling labstore100[1-3]) [16:53:45] they are all in d5 [16:53:55] this is weird [16:53:56] I think jbond42 was restarting mw1 hosts? [16:54:06] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [16:54:07] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:24] ah, tx marostegui [16:54:29] sorry that was me, expired down time [16:55:50] no worries, we're just a paranoid lot! [16:56:06] (i consider paranoia a positive) [16:56:59] (03PS10) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [16:57:01] (03PS3) 10Mathew.onipe: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) [16:57:20] (03CR) 10RobH: [C: 03+2] remove labstore100[123] repo entries [puppet] - 10https://gerrit.wikimedia.org/r/514343 (https://phabricator.wikimedia.org/T187456) (owner: 10RobH) [16:57:27] yes a little bit of paranoia dosn;t hurt [16:57:51] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:57:51] also it isn't paranoia if they are after you! [16:58:06] !log deleted some gerrit changes [16:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:22] (03CR) 10Mathew.onipe: wdqs: add WDQS restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [16:58:24] even if you are actually paranoid they might still be after you! [16:59:00] the mind boggles [16:59:57] (03PS1) 10RobH: decom labstore100[1-3] prod dns [dns] - 10https://gerrit.wikimedia.org/r/514350 (https://phabricator.wikimedia.org/T187456) [17:00:04] cscott, arlolra, subbu, and halfak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190604T1700). [17:00:56] (03PS2) 10Ayounsi: Network monitoring, make core router down page [puppet] - 10https://gerrit.wikimedia.org/r/514332 (https://phabricator.wikimedia.org/T224535) [17:00:58] no parsoid deploy today [17:01:09] 10Operations, 10User-herron: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10herron) p:05Triage→03Normal [17:01:50] (03CR) 10RobH: [C: 03+2] decom labstore100[1-3] prod dns [dns] - 10https://gerrit.wikimedia.org/r/514350 (https://phabricator.wikimedia.org/T187456) (owner: 10RobH) [17:03:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) [17:03:36] (03PS1) 10Kosta Harlan: PageTriage: Log debug level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514353 [17:04:09] PROBLEM - Host mw1239 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) [17:04:41] 10Operations, 10User-herron: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10herron) Fwiw kafka2001 is the current controller so thinking we should start with kafka2003 -> kafka-main2003 [17:05:27] RECOVERY - Host mw1239 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [17:07:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) a:05RobH→03Cmjohnson For some reason this lacked the #decommission tag and I didn't know about it until @MoritzMuehlenhoff pinged... [17:09:57] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:10:16] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/16868/" [puppet] - 10https://gerrit.wikimedia.org/r/514332 (https://phabricator.wikimedia.org/T224535) (owner: 10Ayounsi) [17:10:31] PROBLEM - Host mw1302 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:31] PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:59] 10Operations, 10ops-eqiad, 10DBA, 10Goal, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) 05Open→03Stalled [17:12:17] RECOVERY - Host mw1299 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:17:27] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3045.esams.wmnet'] ` and were **ALL** successful. [17:17:39] RECOVERY - Host mw1302 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:18:21] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) raid config: - raid1 for the two os volumes - one big raid6 w/lvm for the remaining internal drives - another big raid6 w/lvm for... [17:21:01] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:21:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:14] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:27:16] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:59] 10Operations, 10ops-codfw: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) Tracking service implementation in T225005 [17:33:44] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:33:45] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:04] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:37:05] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:55] PROBLEM - Nginx local proxy to videoscaler on mw1302 is CRITICAL: connect to address 10.64.16.67 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [17:38:09] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:38:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:33] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:40:22] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [17:40:39] RECOVERY - Nginx local proxy to videoscaler on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [17:43:49] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, 10serviceops, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Tgr) > it looks like JADE is focused specifically on revisions, and I wouldn't expect the machine vision-derived labels t... [17:45:39] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, 10serviceops, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Tgr) >>! In T224917#5234081, @Mholloway wrote: > We'd still need to interface with the third-party machine vision provide... [17:48:01] (03PS2) 10BryanDavis: toolserver: redirect ~nikola/svgtranslate.php to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/512341 (https://phabricator.wikimedia.org/T224265) (owner: 10Aklapper) [17:48:02] (03PS1) 10BryanDavis: toolserver: redirect /tiles to https://tiles.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/514360 [17:49:24] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [17:49:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:01] 10Operations, 10SRE-Access-Requests: Requesting access to Logstash for Cstone - https://phabricator.wikimedia.org/T225010 (10Cstone) [18:02:13] 10Operations, 10ops-eqiad, 10decommission: decommission thulium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T203520 (10RobH) [18:02:20] (03PS1) 10Herron: kafka-main: replace kafka2003 hardware with kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) [18:02:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10RobH) [18:04:14] !log pool cp3045 - T222937 [18:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:19] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [18:05:35] PROBLEM - High CPU load on API appserver on mw1314 is CRITICAL: connect to address 10.64.16.195 port 5666: Connection refused [18:05:41] PROBLEM - High CPU load on API appserver on mw1315 is CRITICAL: connect to address 10.64.16.196 port 5666: Connection refused [18:05:45] PROBLEM - Apache HTTP on mw1313 is CRITICAL: connect to address 10.64.16.194 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [18:05:53] PROBLEM - High CPU load on API appserver on mw1316 is CRITICAL: connect to address 10.64.16.197 port 5666: Connection refused [18:05:58] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [18:06:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:14] * jbond42 sorry forgot the downtime [18:09:57] RECOVERY - High CPU load on API appserver on mw1314 is OK: OK - load average: 2.61, 1.12, 0.41 [18:09:59] !log performing rolling reboots of eqiad logstash hardware hosts for MDS security updates [18:10:03] RECOVERY - High CPU load on API appserver on mw1315 is OK: OK - load average: 2.71, 1.21, 0.45 [18:10:05] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:13] RECOVERY - High CPU load on API appserver on mw1316 is OK: OK - load average: 1.41, 0.62, 0.23 [18:10:39] !log correction — performing rolling reboots of codfw logstash hardware hosts for MDS security updates [18:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:45] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [18:17:46] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:03] (03PS3) 10Andrew Bogott: toolserver: redirect ~nikola/svgtranslate.php to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/512341 (https://phabricator.wikimedia.org/T224265) (owner: 10Aklapper) [18:21:08] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 53.81, 31.67, 23.38 [18:28:26] PROBLEM - Nginx local proxy to apache on mw1227 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:28:32] PROBLEM - Apache HTTP on mw1227 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:28:40] ^^ looking [18:28:48] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 1.26, 18.87, 22.95 [18:29:38] RECOVERY - Nginx local proxy to apache on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.260 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:29:42] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:35:45] (03PS1) 10Ottomata: Allow configuring eventschemas service port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/514366 [18:36:47] (03CR) 10jerkins-bot: [V: 04-1] Allow configuring eventschemas service port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/514366 (owner: 10Ottomata) [18:36:48] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: connect to address 10.64.32.53 port 5666: Connection refused [18:36:50] PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: connect to address 10.64.32.55 port 5666: Connection refused [18:37:06] PROBLEM - High CPU load on API appserver on mw1342 is CRITICAL: connect to address 10.64.32.54 port 5666: Connection refused [18:37:10] PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: connect to address 10.64.32.52 port 5666: Connection refused [18:38:14] (03PS2) 10Ottomata: Allow configuring eventschemas service port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/514366 [18:38:52] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [18:38:54] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:42] (03CR) 10Ottomata: [C: 03+2] Allow configuring eventschemas service port via hiera [puppet] - 10https://gerrit.wikimedia.org/r/514366 (owner: 10Ottomata) [18:40:56] RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 3.90, 1.21, 0.42 [18:41:00] RECOVERY - High CPU load on API appserver on mw1343 is OK: OK - load average: 3.08, 0.91, 0.31 [18:41:18] RECOVERY - High CPU load on API appserver on mw1342 is OK: OK - load average: 2.49, 0.86, 0.30 [18:41:24] RECOVERY - High CPU load on API appserver on mw1340 is OK: OK - load average: 4.68, 1.62, 0.58 [18:42:23] jbond42: are those you, the connection refused on appservers? [18:42:43] apergos: yes sorry missed the downtime again [18:43:00] ok, it didn't page but it did make me wonder [18:43:25] almost down for the evening :) [18:45:16] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [18:45:18] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:02] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 74.42, 39.24, 27.21 [18:47:14] ^^ looking [18:47:50] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 67.99, 37.94, 27.10 [18:48:41] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10fsero) Hi @Tgr :) I'm following this up, according to https://github.com/matrix-org/synapse/blob/master/docs/fede... [18:49:28] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 55.72, 34.93, 23.95 [18:51:02] _joe_ jijiki have started to get a few serveres comming back from a reboot with hhvm taking a lot of CPU any insight? [18:51:38] (03CR) 10Andrew Bogott: [C: 03+2] toolserver: redirect ~nikola/svgtranslate.php to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/512341 (https://phabricator.wikimedia.org/T224265) (owner: 10Aklapper) [18:52:30] (03CR) 10Jhedden: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/512341 (https://phabricator.wikimedia.org/T224265) (owner: 10Aklapper) [18:52:35] 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban): Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10greg) [18:53:56] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:54:06] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:54:11] (03PS7) 10Herron: rsyslog: add netdev_kafka_relay compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/495980 (https://phabricator.wikimedia.org/T224128) [18:59:21] (03PS1) 10Ottomata: profile::eventschemas::service - allow server_alias to be configured via hiera [puppet] - 10https://gerrit.wikimedia.org/r/514373 [19:00:48] (03CR) 10Ottomata: [C: 03+2] profile::eventschemas::service - allow server_alias to be configured via hiera [puppet] - 10https://gerrit.wikimedia.org/r/514373 (owner: 10Ottomata) [19:02:12] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 18.43, 20.40, 23.84 [19:02:30] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [19:02:40] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [19:02:58] (03CR) 10Herron: [C: 03+2] rsyslog: add netdev_kafka_relay compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/495980 (https://phabricator.wikimedia.org/T224128) (owner: 10Herron) [19:03:07] (03PS8) 10Herron: rsyslog: add netdev_kafka_relay compatibility endpoint [puppet] - 10https://gerrit.wikimedia.org/r/495980 (https://phabricator.wikimedia.org/T224128) [19:07:10] (03PS1) 10Ottomata: Remove type of $port; it doesn't work via horizon hiera interface [puppet] - 10https://gerrit.wikimedia.org/r/514375 [19:07:34] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 20.91, 20.91, 23.75 [19:08:03] (03CR) 10Ottomata: [C: 03+2] Remove type of $port; it doesn't work via horizon hiera interface [puppet] - 10https://gerrit.wikimedia.org/r/514375 (owner: 10Ottomata) [19:08:14] (03PS2) 10Ottomata: Remove type of $port; it doesn't work via horizon hiera interface [puppet] - 10https://gerrit.wikimedia.org/r/514375 [19:08:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove type of $port; it doesn't work via horizon hiera interface [puppet] - 10https://gerrit.wikimedia.org/r/514375 (owner: 10Ottomata) [19:09:34] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [19:10:35] ^that’s me [19:15:12] (03PS1) 10Herron: rsyslog: remove syslog json template from netdev_kafka_relay [puppet] - 10https://gerrit.wikimedia.org/r/514376 (https://phabricator.wikimedia.org/T224128) [19:17:10] (03CR) 10Herron: [C: 03+2] rsyslog: remove syslog json template from netdev_kafka_relay [puppet] - 10https://gerrit.wikimedia.org/r/514376 (https://phabricator.wikimedia.org/T224128) (owner: 10Herron) [19:19:16] PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [19:19:43] ^ also me, should clear in a moment [19:20:22] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:20:56] PROBLEM - puppet last run on centrallog1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [19:21:20] PROBLEM - HHVM jobrunner on mw1310 is CRITICAL: connect to address 10.64.0.172 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [19:21:34] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [19:21:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:42] ^^ me [19:21:52] PROBLEM - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 8.473e+05 ge 2.592e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [19:24:42] RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:25:36] RECOVERY - HHVM jobrunner on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [19:26:22] RECOVERY - puppet last run on centrallog1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:28:14] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [19:28:17] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:36] (03PS1) 10Volans: Remove reminiscences of Python 2 [software/conftool] - 10https://gerrit.wikimedia.org/r/514378 [19:28:38] (03PS1) 10Volans: dbconfig: exit with proper exit code [software/conftool] - 10https://gerrit.wikimedia.org/r/514379 [19:31:13] (03CR) 10jerkins-bot: [V: 04-1] Remove reminiscences of Python 2 [software/conftool] - 10https://gerrit.wikimedia.org/r/514378 (owner: 10Volans) [19:31:15] (03CR) 10jerkins-bot: [V: 04-1] dbconfig: exit with proper exit code [software/conftool] - 10https://gerrit.wikimedia.org/r/514379 (owner: 10Volans) [19:33:55] (03PS2) 10Volans: Remove reminiscences of Python 2 [software/conftool] - 10https://gerrit.wikimedia.org/r/514378 [19:33:57] (03PS2) 10Volans: dbconfig: exit with proper exit code [software/conftool] - 10https://gerrit.wikimedia.org/r/514379 [19:35:08] 10Operations, 10Wikimedia-Logstash, 10netops, 10Patch-For-Review, 10User-herron: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 (10herron) A syslog UDP listener on port 10514 is now running on lithium/wezen, and forwarding messages received to the Kaf... [19:35:45] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [19:35:47] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:36] !log reboot mwdebug1001 [19:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:40] PROBLEM - IPMI Sensor Status on cp3035 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [19:40:09] (03CR) 10CDanis: [C: 03+2] dbconfig: exit with proper exit code [software/conftool] - 10https://gerrit.wikimedia.org/r/514379 (owner: 10Volans) [19:40:49] (03CR) 10CDanis: [C: 03+2] Remove reminiscences of Python 2 [software/conftool] - 10https://gerrit.wikimedia.org/r/514378 (owner: 10Volans) [19:41:02] 10Operations, 10Wikimedia-Logstash, 10netops, 10Patch-For-Review, 10User-herron: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 (10herron) >>! In T224128#5234630, @herron wrote: > Before moving production logs to this I think we should decide on some... [19:41:28] !log reboot mwdebug1002 [19:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:46] ACKNOWLEDGEMENT - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 8.484e+05 ge 2.592e+05 Mathew.onipe We know. Its expected. https://phabricator.wikimedia.org/T224874 - The acknowledgement expires at: 2019-06-10 19:39:07. https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [19:41:49] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [19:41:51] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:22] (03Merged) 10jenkins-bot: Remove reminiscences of Python 2 [software/conftool] - 10https://gerrit.wikimedia.org/r/514378 (owner: 10Volans) [19:43:26] (03Merged) 10jenkins-bot: dbconfig: exit with proper exit code [software/conftool] - 10https://gerrit.wikimedia.org/r/514379 (owner: 10Volans) [19:48:08] !log replace logstash.svc.eqiad.wmnet syslog target with syslog.codfw.wmnet on cr4-ulsfo - T224128 [19:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:15] T224128: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 [19:48:26] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [19:48:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:29] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [19:55:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:58] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [20:03:00] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:41] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10pmiazga) @Niedzielski noticed another, pretty similar issue: {T225018}. System fails with `PHP Fatal error... [20:06:56] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 49.54, 34.29, 28.61 [20:10:50] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [20:10:52] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:35] 10Operations, 10Page-Previews, 10Readers-Web-Backlog, 10Wikimedia-production-error: [Bug] TypeError in PopupsContext - https://phabricator.wikimedia.org/T225018 (10pmiazga) [20:20:59] PROBLEM - Host mw1319 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:31] looking ^^ [20:24:45] RECOVERY - Host mw1319 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [20:25:26] ^^ got stuck on reboot [20:26:02] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [20:26:04] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:20] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) [20:36:05] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [20:36:06] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:59] (03PS1) 10CDanis: dbctl: subcommands for 'section' and 'config' are required [software/conftool] - 10https://gerrit.wikimedia.org/r/514392 [20:41:53] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 19.40, 21.43, 23.80 [20:50:59] (03CR) 10Volans: [C: 03+2] dbctl: subcommands for 'section' and 'config' are required [software/conftool] - 10https://gerrit.wikimedia.org/r/514392 (owner: 10CDanis) [20:55:22] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: connect to address 10.64.16.105 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [20:55:27] (03Merged) 10jenkins-bot: dbctl: subcommands for 'section' and 'config' are required [software/conftool] - 10https://gerrit.wikimedia.org/r/514392 (owner: 10CDanis) [20:55:43] ^^ me sorry [20:55:44] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:55:59] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [20:56:00] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:12] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [20:57:28] (03PS4) 10Muehlenhoff: Remove support for Ubuntu in apt/debmonitor base classes [puppet] - 10https://gerrit.wikimedia.org/r/498134 [20:57:58] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:58:54] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:03:42] RECOVERY - Host mw1300 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [21:04:27] (03PS1) 10Muehlenhoff: Extend access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/514393 [21:07:50] !log finished tolling reboots of mw1* servers [21:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:03] !log finished rolling reboots of mw1* servers [21:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:34] (03PS1) 10Jforrester: modules/varnish/templates/text-frontend.inc.vcl.erb: Fix doc reference to renamed variable [puppet] - 10https://gerrit.wikimedia.org/r/514394 [21:10:56] (03PS1) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [21:11:10] (03CR) 10jerkins-bot: [V: 04-1] modules/varnish/templates/text-frontend.inc.vcl.erb: Fix doc reference to renamed variable [puppet] - 10https://gerrit.wikimedia.org/r/514394 (owner: 10Jforrester) [21:11:47] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [21:12:22] (03CR) 10Legoktm: [C: 03+1] "LGTM, but I don't have +2 rights here." (031 comment) [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 (owner: 10MarkAHershberger) [21:14:01] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for nathante [puppet] - 10https://gerrit.wikimedia.org/r/514393 (owner: 10Muehlenhoff) [21:29:16] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:36:33] 10Operations, 10Wikimedia-Logstash, 10netops, 10User-herron: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 (10ayounsi) a:03ayounsi Test was successful, next step is to do the change to all devices. Note that this would be a great use of anycast. Onl... [21:50:46] (03PS2) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [21:51:43] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [21:58:43] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) [21:59:06] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:17:24] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:18:14] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) Both MD's are racked and Netbox updated. [22:23:55] 10Operations, 10ops-esams, 10Traffic: cp3035 PS Redundancy Lost - https://phabricator.wikimedia.org/T225035 (10ayounsi) p:05Triage→03High [22:24:56] (03CR) 10Ori.livneh: "Ping" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [22:25:12] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3035 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi https://phabricator.wikimedia.org/T225035 [22:25:31] (03CR) 10Samwilson: [C: 03+1] toolserver: redirect ~nikola/svgtranslate.php to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/512341 (https://phabricator.wikimedia.org/T224265) (owner: 10Aklapper) [22:26:08] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:28:54] (03PS3) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [22:29:05] 10Operations, 10ops-esams: Degraded RAID on cp3041 - https://phabricator.wikimedia.org/T220193 (10ayounsi) 05Open→03Invalid This seems to be due to a transient connectivity issue, and not a RAID issue. [22:29:40] 10Operations, 10ops-esams: Degraded RAID on cp3034 - https://phabricator.wikimedia.org/T220194 (10ayounsi) 05Open→03Invalid This seems to be due to a transient connectivity issue, and not a RAID issue. [22:29:54] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [22:32:20] fsero: regarding the Prometheus exported, I already told akosiaris about it. I actually asked the original author to publish the code :) [22:32:25] *exporter [22:38:36] 10Operations, 10ops-esams, 10Traffic: cp3035 PS Redundancy Lost - https://phabricator.wikimedia.org/T225035 (10RobH) a:03wiki_willy This system is no longer under warranty. This is unlikely, but still could possibly be, due to the power cable becoming unseated. Since no one ever opens the ESAMS racks tho... [22:40:37] I assume whomever is aware of the slowness with the CI pipeline? This patch was +2'd and it took 22 hours before the gate-and-submit jobs completed, and it failed because of something unrelated :( https://gerrit.wikimedia.org/r/c/mediawiki/core/+/514105 [22:44:24] RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [22:45:49] musikanimal: that question/statement might get more attention in #wikimedia-releng [22:46:07] yup! I was just sent there by someone else. Ty [22:48:53] (03PS1) 10BBlack: cache: move dead node cp3037 to upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/514406 (https://phabricator.wikimedia.org/T222041) [22:48:55] (03PS1) 10BBlack: cache: reimage cp3043 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/514407 (https://phabricator.wikimedia.org/T222937) [22:56:53] (03PS4) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [22:57:33] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [22:59:52] (03PS5) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190604T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:18] I'll SWAT [23:00:43] (03CR) 10Catrope: [C: 03+2] PageTriage: Log debug level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514353 (owner: 10Kosta Harlan) [23:00:50] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [23:01:55] (03Merged) 10jenkins-bot: PageTriage: Log debug level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514353 (owner: 10Kosta Harlan) [23:02:11] (03CR) 10jenkins-bot: PageTriage: Log debug level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514353 (owner: 10Kosta Harlan) [23:03:38] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Change log level to debug for PageTriage (duration: 01m 03s) [23:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:15] And done [23:05:53] (03PS6) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:07:05] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [23:09:48] (03PS7) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:10:47] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [23:12:42] (03PS8) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:47:58] (03PS1) 10Alaa Sarhan: Add new terms normalized schema tables as public 1:1 views in labs. [puppet] - 10https://gerrit.wikimedia.org/r/514411 (https://phabricator.wikimedia.org/T225038) [23:53:56] (03CR) 10Alaa Sarhan: [C: 04-1] "DNM - this should not be merged before those tables exist in production." [puppet] - 10https://gerrit.wikimedia.org/r/514411 (https://phabricator.wikimedia.org/T225038) (owner: 10Alaa Sarhan)