[00:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191203T0000). [00:00:04] Ammarpad and Krinkle: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:09] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:01:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:27] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:02:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:37] (03PS1) 10Dzahn: fix comment about location of VM gerrit1002 [dns] - 10https://gerrit.wikimedia.org/r/554193 [00:04:20] (03CR) 10Dzahn: [C: 03+2] fix comment about location of VM gerrit1002 [dns] - 10https://gerrit.wikimedia.org/r/554193 (owner: 10Dzahn) [00:05:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:09] RECOVERY - mediawiki-installation DSH group on mw2251 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:08:54] (03PS1) 10BBlack: Bump to 0.9, depend on pdns_recursor service [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/554194 (https://phabricator.wikimedia.org/T239667) [00:09:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:53] (03PS1) 10BBlack: pdns-rec-exporter should depend on pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/554196 (https://phabricator.wikimedia.org/T239667) [00:11:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:54] (03CR) 10BBlack: [C: 03+2] pdns-rec-exporter should depend on pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/554196 (https://phabricator.wikimedia.org/T239667) (owner: 10BBlack) [00:19:55] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10colewhite) The necessary changes have been deployed. Please let me know if you encounter any related issue. [00:20:03] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10colewhite) 05Open→03Resolved [00:20:39] (03CR) 10BBlack: [C: 03+2] Bump to 0.9, depend on pdns_recursor service [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/554194 (https://phabricator.wikimedia.org/T239667) (owner: 10BBlack) [00:25:43] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Echo/includes/DiscussionParser.php: T239275 Fix type hint fatal from getUserLinks() (duration: 01m 16s) [00:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:49] T239275: Deferred update EchoHooks::onPageContentSaveComplete failed: Argument passed to generateMentionEvents() must be array - https://phabricator.wikimedia.org/T239275 [00:27:35] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1651.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:28:04] (03PS1) 10BBlack: require, not after [puppet] - 10https://gerrit.wikimedia.org/r/554197 (https://phabricator.wikimedia.org/T239667) [00:28:21] (03CR) 10BBlack: [V: 03+2 C: 03+2] require, not after [puppet] - 10https://gerrit.wikimedia.org/r/554197 (https://phabricator.wikimedia.org/T239667) (owner: 10BBlack) [00:28:24] 10Operations, 10SRE-Access-Requests: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10colewhite) [00:30:42] 10Operations, 10SRE-Access-Requests: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10colewhite) Hi Maryum! I am happy to get started on this for you. There are a few things we'll need to proceed. I've added the checklist to the description. [00:30:53] 10Operations, 10SRE-Access-Requests: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10colewhite) p:05Triage→03Normal [00:36:15] (03PS1) 10Bstorm: toolforge calico: Can no longer copy binaries like this [puppet] - 10https://gerrit.wikimedia.org/r/554198 [00:41:16] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2250.codfw.wmnet'] ` and were **ALL** successful. [00:45:22] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1002/19736/tools-k8s-control-1.tools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/554198 (owner: 10Bstorm) [00:45:43] (03PS2) 10Bstorm: toolforge calico: Can no longer copy binaries like this [puppet] - 10https://gerrit.wikimedia.org/r/554198 [00:48:00] (03PS5) 10Jforrester: Enable Translate extension on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552870 (https://phabricator.wikimedia.org/T239091) (owner: 10Ammarpad) [00:48:04] (03CR) 10Jforrester: [C: 03+2] Enable Translate extension on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552870 (https://phabricator.wikimedia.org/T239091) (owner: 10Ammarpad) [00:48:28] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2252.codfw.wmnet'] ` and were **ALL** successful. [00:48:31] (03CR) 10Bstorm: [C: 03+2] toolforge calico: Can no longer copy binaries like this [puppet] - 10https://gerrit.wikimedia.org/r/554198 (owner: 10Bstorm) [00:48:59] (03Merged) 10jenkins-bot: Enable Translate extension on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552870 (https://phabricator.wikimedia.org/T239091) (owner: 10Ammarpad) [00:49:01] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Jclark-ctr) @elukey Received cards please message me on irc and we can start scheduling replacement [00:49:07] (03PS3) 10Jforrester: Add sewikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553449 (https://phabricator.wikimedia.org/T239318) (owner: 10Urbanecm) [00:49:12] (03CR) 10Jforrester: [C: 03+2] Add sewikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553449 (https://phabricator.wikimedia.org/T239318) (owner: 10Urbanecm) [00:50:00] (03Merged) 10jenkins-bot: Add sewikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553449 (https://phabricator.wikimedia.org/T239318) (owner: 10Urbanecm) [00:54:57] !log mwscript sql.php --wiki=sewikimedia php-1.35.0-wmf.5/extensions/Wikibase/client/sql/entity_usage.sql [00:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:30] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T239091 Enable Translate extension on sewikimedia (duration: 00m 57s) [00:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:35] T239091: Enable Extension:Translate on se.wikimedia.org - https://phabricator.wikimedia.org/T239091 [00:57:30] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2254.codfw.wmnet'] ` and were **ALL** successful. [01:00:04] !log mwscript sql.php --wiki=sewikimedia php-1.35.0-wmf.8/extensions/Translate/sql/translate_{…}.sql T239091 [01:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:47] (03PS6) 10Andrew Bogott: Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 [01:01:49] (03PS1) 10Andrew Bogott: Openstack Keystone: update the password_whitelist auth plugin [puppet] - 10https://gerrit.wikimedia.org/r/554199 (https://phabricator.wikimedia.org/T237749) [01:02:44] (03PS1) 10Jforrester: Revert "Enable Translate extension on sewikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554200 [01:02:49] (03CR) 10Jforrester: [C: 03+2] Revert "Enable Translate extension on sewikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554200 (owner: 10Jforrester) [01:03:05] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Keystone: update the password_whitelist auth plugin [puppet] - 10https://gerrit.wikimedia.org/r/554199 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [01:03:52] (03Merged) 10jenkins-bot: Revert "Enable Translate extension on sewikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554200 (owner: 10Jforrester) [01:05:31] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T239091 Revert 'Enable Translate extension on sewikimedia' (duration: 01m 01s) [01:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:37] T239091: Enable Extension:Translate on se.wikimedia.org - https://phabricator.wikimedia.org/T239091 [01:08:28] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, 10User-Elukey: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10Jclark-ctr) [01:14:57] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2253.codfw.wmnet'] ` and were **ALL** successful. [01:15:58] !log jforrester@deploy1001 Synchronized dblists/wikidataclient.dblist: T239318 Add sewikimedia to wikidataclient (duration: 01m 03s) [01:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:04] T239318: Enable Extension:Wikibase Client on se.wikimedia.org - https://phabricator.wikimedia.org/T239318 [01:20:57] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.5/includes/diff/DifferenceEngine.php: T236320 Don't calculate amount of inbetween revisions for MCR undo (duration: 00m 59s) [01:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:02] T236320: Internal error while undoing file captions: "Unsaved revision passed" - https://phabricator.wikimedia.org/T236320 [01:22:25] !log mw2254 - rebooting (reimage script exited with segfault after reimage was done) [01:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:02] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2253.codfw.wmnet [01:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:38] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2254.codfw.wmnet [01:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:24] !log mw2252 rebooting [01:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:40] !log mw2250 - E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem. [01:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:14] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2250.codfw.wmnet [01:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:42] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2252.codfw.wmnet [01:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:02] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Jclark-ctr) Confirmed: Service Request 1004932600 was successfully submitted. [01:42:14] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [01:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:30] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [01:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:49] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) Tried to create it but unfortunately: ` Failure: prerequisites not met for this operation: error type: insufficient_resources, error details: Can't compute nodes using iallocator 'h... [01:55:07] (03PS1) 10Jforrester: [Retry] Enable Translate extension on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554208 (https://phabricator.wikimedia.org/T239091) [01:55:17] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/VisualEditor/: T239209 Sanitize HTML on paste (duration: 01m 24s) [01:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:40] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Traffic: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10Dzahn) a:03Dzahn [01:58:57] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/VisualEditor/: T239209 Sanitize HTML on paste (duration: 01m 33s) [01:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:23] PROBLEM - PHP7 rendering on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:01:27] PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:01:39] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:01:39] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:01:41] (03CR) 10Jforrester: [C: 03+2] [Retry] Enable Translate extension on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554208 (https://phabricator.wikimedia.org/T239091) (owner: 10Jforrester) [02:02:11] PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:02:13] PROBLEM - Nginx local proxy to apache on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:02:19] PROBLEM - Nginx local proxy to apache on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:02:23] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:02:31] (03Merged) 10jenkins-bot: [Retry] Enable Translate extension on sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554208 (https://phabricator.wikimedia.org/T239091) (owner: 10Jforrester) [02:03:19] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 76067 bytes in 0.425 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:03:58] James_F: Translate is a bit more complex than that [02:04:10] why WikimediaMaintenance has a createExtensionTables.php script ;) [02:04:11] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:04:16] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T239091 Enable Translate extension on sewikimedia, second try (duration: 01m 24s) [02:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:21] T239091: Enable Extension:Translate on se.wikimedia.org - https://phabricator.wikimedia.org/T239091 [02:04:44] `mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=sewikimedia translate` [02:04:55] RECOVERY - PHP7 rendering on mw1322 is OK: HTTP OK: HTTP/1.1 200 OK - 76069 bytes in 2.698 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:05:18] I'm guessing you missed the revtag.sql file [02:05:28] Reedy: I did. [02:05:33] Which is differently named.. because god knows [02:05:37] PROBLEM - Nginx local proxy to apache on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:05:37] Fixed now. [02:06:35] PROBLEM - Check systemd state on boron is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:23] Reedy: I've used createExtensionTables before, but had forgotten it. Oh well. :-) [02:07:38] heh [02:08:05] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:08:47] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:08:49] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:08:58] (03PS2) 10Jforrester: GrowthExperiments: Remove unused config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552502 (owner: 10Kosta Harlan) [02:09:35] PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:09:39] PROBLEM - PHP7 rendering on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:09:49] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 76093 bytes in 3.917 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:09:51] (03CR) 10Jforrester: [C: 03+2] "Dropped in 1.34.0-wmf.24." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552502 (owner: 10Kosta Harlan) [02:10:12] (03PS2) 10Jforrester: Beta labs: Remove unused GrowthExperiments config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552501 (owner: 10Kosta Harlan) [02:10:18] (03CR) 10Jforrester: [C: 03+2] Beta labs: Remove unused GrowthExperiments config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552501 (owner: 10Kosta Harlan) [02:10:29] RECOVERY - PHP7 rendering on mw1321 is OK: HTTP OK: HTTP/1.1 200 OK - 76093 bytes in 5.367 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:10:44] (03Merged) 10jenkins-bot: GrowthExperiments: Remove unused config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552502 (owner: 10Kosta Harlan) [02:11:03] PROBLEM - PHP7 rendering on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:11:11] (03Merged) 10jenkins-bot: Beta labs: Remove unused GrowthExperiments config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552501 (owner: 10Kosta Harlan) [02:11:13] RECOVERY - Nginx local proxy to apache on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:11:17] (03CR) 10Jforrester: [C: 04-1] "Not until wmf.8 is everywhere, per Roan." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553222 (https://phabricator.wikimedia.org/T232396) (owner: 10Catrope) [02:11:17] RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.585 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:11:17] RECOVERY - Nginx local proxy to apache on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 2.895 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:11:19] RECOVERY - PHP7 rendering on mw1322 is OK: HTTP OK: HTTP/1.1 200 OK - 76091 bytes in 0.170 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:11:51] (03PS1) 10Dzahn: ssl: add releases-jenkins to releases puppet TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/554209 (https://phabricator.wikimedia.org/T239629) [02:12:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:12:57] PROBLEM - Check the last execution of package_builder_Clean_up_build_directory on boron is CRITICAL: CRITICAL: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:13:04] (03PS3) 10Jforrester: Remove wgTorLoadNodes as it was removed in b5ccbee in 1.340-wmf.15+ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550055 (owner: 10Reedy) [02:13:06] (03CR) 10Dzahn: [C: 03+2] "openssl x509 -text -noout -in releases.discovery.wmnet.crt | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/554209 (https://phabricator.wikimedia.org/T239629) (owner: 10Dzahn) [02:13:10] (03CR) 10Jforrester: [C: 03+2] Remove wgTorLoadNodes as it was removed in b5ccbee in 1.340-wmf.15+ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550055 (owner: 10Reedy) [02:13:16] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgGEHelpPanelSearchEnabled, no longer used (duration: 01m 08s) [02:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:44] (03CR) 10Jforrester: "(Needs wmf.8.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549936 (https://phabricator.wikimedia.org/T161553) (owner: 10Andrew Bogott) [02:14:09] (03Merged) 10jenkins-bot: Remove wgTorLoadNodes as it was removed in b5ccbee in 1.340-wmf.15+ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550055 (owner: 10Reedy) [02:14:09] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.194 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:14:26] (03CR) 10Jforrester: [C: 04-1] "Waiting on redirect by SRE." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552549 (https://phabricator.wikimedia.org/T238803) (owner: 10Jforrester) [02:14:47] RECOVERY - Nginx local proxy to apache on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.327 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:15:23] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:15:28] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Traffic, 10Patch-For-Review: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10Dzahn) @hashar It's fixed for me now. It was missing the releases-jenkins.w... [02:15:48] (03PS10) 10Jforrester: Rename DPL extension variable to non-ambiguous name, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 (https://phabricator.wikimedia.org/T237698) (owner: 10Ammarpad) [02:15:53] PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:15:54] (03CR) 10Jforrester: [C: 03+2] Rename DPL extension variable to non-ambiguous name, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 (https://phabricator.wikimedia.org/T237698) (owner: 10Ammarpad) [02:15:59] PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:15:59] PROBLEM - Nginx local proxy to apache on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:16:07] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Traffic, 10Patch-For-Review: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10Dzahn) 05Open→03Resolved [02:16:16] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Traffic: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10Dzahn) [02:16:27] RECOVERY - Nginx local proxy to apache on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.467 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:16:45] (03Merged) 10jenkins-bot: Rename DPL extension variable to non-ambiguous name, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 (https://phabricator.wikimedia.org/T237698) (owner: 10Ammarpad) [02:16:48] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop setting wgTorLoadNodes, not read for a while (duration: 01m 14s) [02:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:54] (03PS4) 10Jforrester: Rename DPL extension variable to non-ambiguous name, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549666 (https://phabricator.wikimedia.org/T237698) (owner: 10Ammarpad) [02:16:59] (03CR) 10Jforrester: [C: 03+2] Rename DPL extension variable to non-ambiguous name, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549666 (https://phabricator.wikimedia.org/T237698) (owner: 10Ammarpad) [02:17:06] (03PS4) 10Jforrester: Rename DPL extension variable to non-ambiguous name, part 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549697 (https://phabricator.wikimedia.org/T237698) (owner: 10Ammarpad) [02:17:37] RECOVERY - Nginx local proxy to apache on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:17:47] RECOVERY - Nginx local proxy to apache on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.465 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:17:47] (03Merged) 10jenkins-bot: Rename DPL extension variable to non-ambiguous name, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549666 (https://phabricator.wikimedia.org/T237698) (owner: 10Ammarpad) [02:18:07] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:18:13] (03CR) 10Jforrester: [C: 03+2] Rename DPL extension variable to non-ambiguous name, part 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549697 (https://phabricator.wikimedia.org/T237698) (owner: 10Ammarpad) [02:18:47] PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:18:53] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:19:02] (03Merged) 10jenkins-bot: Rename DPL extension variable to non-ambiguous name, part 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549697 (https://phabricator.wikimedia.org/T237698) (owner: 10Ammarpad) [02:19:11] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T237698 Set wmgUseDynamicPageList, less cryptic form of wmgUseDPL (duration: 01m 16s) [02:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:16] T237698: Rename DynamicPageList config variable name to non-ambiguous - https://phabricator.wikimedia.org/T237698 [02:19:18] (03PS5) 10Jforrester: Switch to wmf specific run mode for $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [02:19:31] (03PS6) 10Jforrester: Switch to wmf specific run mode for $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [02:21:13] (03PS2) 10Jforrester: Remove testwiki => true from wmgUseCentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544193 (owner: 10Reedy) [02:21:15] (03CR) 10Jforrester: [C: 03+2] Remove testwiki => true from wmgUseCentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544193 (owner: 10Reedy) [02:21:37] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T237698 Read wmgUseDynamicPageList not wmgUseDPL (duration: 01m 22s) [02:21:43] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 76093 bytes in 7.323 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:09] (03Merged) 10jenkins-bot: Remove testwiki => true from wmgUseCentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544193 (owner: 10Reedy) [02:22:25] PROBLEM - Nginx local proxy to apache on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:22:26] (03CR) 10Jforrester: [C: 03+2] Switch to wmf specific run mode for $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [02:22:55] PROBLEM - Nginx local proxy to apache on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:23:18] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T237698 Stop setting wmgUseDPL, unread (duration: 01m 11s) [02:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:43] RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 76093 bytes in 8.901 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:23:55] (03PS7) 10Jforrester: Switch to wmf specific run mode for $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [02:24:13] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.919 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:24:23] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 2.935 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:25:25] (03CR) 10Jforrester: [C: 03+1] "Ping. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526509 (https://phabricator.wikimedia.org/T222240) (owner: 10Thcipriani) [02:25:38] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [02:25:50] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting testwiki => true for wmgUseCentralAuth, already implied by default (duration: 01m 24s) [02:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:37] (03Merged) 10jenkins-bot: Switch to wmf specific run mode for $wgDisableQueryPageUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [02:26:47] PROBLEM - Nginx local proxy to apache on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:27:55] PROBLEM - PHP7 rendering on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:28:09] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:28:19] PROBLEM - PHP7 rendering on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:28:25] RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.875 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:28:57] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:29:05] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:29:41] RECOVERY - PHP7 rendering on mw1322 is OK: HTTP OK: HTTP/1.1 200 OK - 76127 bytes in 5.593 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:30:07] RECOVERY - Nginx local proxy to apache on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.224 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:31:39] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 76125 bytes in 0.597 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:32:33] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.318 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:32:43] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T78711 Display 'twice a month' or 'once a month' on cached reports (duration: 01m 19s) [02:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:49] T78711: querypage-no-updates still shown on special pages on wmf wikis that update from cron - https://phabricator.wikimedia.org/T78711 [02:32:55] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:33:13] RECOVERY - Nginx local proxy to apache on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.694 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:34:17] (03PS1) 10Dzahn: airflow: add a local mariadb server [puppet] - 10https://gerrit.wikimedia.org/r/554215 (https://phabricator.wikimedia.org/T236180) [02:34:47] PROBLEM - Nginx local proxy to apache on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:34:59] RECOVERY - PHP7 rendering on mw1321 is OK: HTTP OK: HTTP/1.1 200 OK - 76125 bytes in 0.622 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:07] (03PS3) 10BBlack: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/551252 [02:35:45] RECOVERY - Nginx local proxy to apache on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 2.819 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:38:13] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:38:21] RECOVERY - Nginx local proxy to apache on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 4.257 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:39:01] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:39:03] RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 76125 bytes in 0.515 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:39:44] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [02:39:45] PROBLEM - Nginx local proxy to apache on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:39:45] PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:39:55] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.744 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:40:29] (03PS2) 10Dzahn: airflow: add a local mariadb server [puppet] - 10https://gerrit.wikimedia.org/r/554215 (https://phabricator.wikimedia.org/T236180) [02:42:23] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.816 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:42:23] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 76127 bytes in 5.970 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:42:57] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [02:44:53] PROBLEM - Nginx local proxy to apache on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:45:39] PROBLEM - PHP7 rendering on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:45:53] PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:46:25] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:46:31] RECOVERY - Nginx local proxy to apache on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:47:03] PROBLEM - Nginx local proxy to apache on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:47:25] RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 76127 bytes in 6.934 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:48:03] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:48:43] RECOVERY - Nginx local proxy to apache on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:48:57] !log mw1320, mw1321 restarted php-fpm [02:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:19] RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:49:31] RECOVERY - Nginx local proxy to apache on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:49:31] RECOVERY - PHP7 rendering on mw1321 is OK: HTTP OK: HTTP/1.1 200 OK - 76092 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:49:48] (03PS3) 10Jforrester: Add growthexperiments dblist, for puppet usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [02:50:40] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546894 (https://phabricator.wikimedia.org/T208369) (owner: 10Gergő Tisza) [02:50:52] 10Operations, 10ops-codfw, 10ops-eqiad, 10Traffic: Add 10G NICs to core site DNS servers (6 servers, 3 per site) - https://phabricator.wikimedia.org/T239675 (10BBlack) [02:50:58] (03CR) 10Jforrester: [C: 04-1] "See parent." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546895 (owner: 10Gergő Tisza) [02:51:15] PROBLEM - Nginx local proxy to apache on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:52:55] RECOVERY - Nginx local proxy to apache on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.972 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:53:27] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:53:27] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:05] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:05] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 76091 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:41] !log mw1269 restarted nginx, php [02:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:31] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.06667 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [02:57:47] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:11] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:01:15] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:05:50] (03PS1) 10Dzahn: phabricator: remove aphlict from webserver config [puppet] - 10https://gerrit.wikimedia.org/r/554219 [03:10:27] (03PS2) 10Dzahn: phabricator: remove aphlict from webserver config [puppet] - 10https://gerrit.wikimedia.org/r/554219 (https://phabricator.wikimedia.org/T238593) [03:11:46] off [03:38:27] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:40:10] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:47:38] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@d00c6ad]: Fix: Apply language headers to zhwiki mobile-html responses (T239659) [03:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:45] T239659: mobile-html: Accept-Language cannot be applied properly - https://phabricator.wikimedia.org/T239659 [03:53:30] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@d00c6ad]: Fix: Apply language headers to zhwiki mobile-html responses (T239659) (duration: 05m 51s) [03:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:35] T239659: mobile-html: Accept-Language cannot be applied properly - https://phabricator.wikimedia.org/T239659 [04:08:45] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:14:36] 10Operations, 10Wikimedia Design Style Guide: Temporarily forward /design-style-guide/components/ to /design-style-guide/components/links.html - https://phabricator.wikimedia.org/T239681 (10Volker_E) [04:15:38] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.8/includes/Rest/EntryPoint.php: disable IE6 safety checks for T239666 (duration: 01m 01s) [04:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:44] T239666: RESTBase requests to Parsoid/PHP that contain a "." in the title (without a / component) fail with a http 403 - https://phabricator.wikimedia.org/T239666 [04:19:10] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.5/includes/Rest/EntryPoint.php: disable IE6 safety checks for T239666 (duration: 01m 00s) [04:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:01] (03PS1) 10VolkerE: Redirect temporarily to /components/links.html when accessing /components/ [puppet] - 10https://gerrit.wikimedia.org/r/554231 (https://phabricator.wikimedia.org/T239681) [05:17:29] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [05:30:05] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0625 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [05:30:55] (03PS7) 10Andrew Bogott: Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 [05:30:57] (03PS1) 10Andrew Bogott: keystone wmtotp: update to conform with new ocata parent class [puppet] - 10https://gerrit.wikimedia.org/r/554232 [05:30:57] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [05:31:48] (03CR) 10jerkins-bot: [V: 04-1] keystone wmtotp: update to conform with new ocata parent class [puppet] - 10https://gerrit.wikimedia.org/r/554232 (owner: 10Andrew Bogott) [05:34:27] (03PS2) 10Andrew Bogott: keystone wmtotp: update to conform with new ocata parent class [puppet] - 10https://gerrit.wikimedia.org/r/554232 (https://phabricator.wikimedia.org/T237749) [05:34:29] (03PS8) 10Andrew Bogott: Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 [05:35:47] (03CR) 10Andrew Bogott: [C: 03+2] keystone wmtotp: update to conform with new ocata parent class [puppet] - 10https://gerrit.wikimedia.org/r/554232 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [05:43:25] (03PS9) 10Andrew Bogott: Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 [05:43:27] (03PS1) 10Andrew Bogott: Keystone wmtotp: remove unused arg [puppet] - 10https://gerrit.wikimedia.org/r/554233 (https://phabricator.wikimedia.org/T237749) [05:44:49] (03PS2) 10Ema: ATS: log Cache-Control and Set-Cookie responses from origin [puppet] - 10https://gerrit.wikimedia.org/r/552853 (https://phabricator.wikimedia.org/T238494) [05:45:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P9798 and previous config saved to /var/cache/conftool/dbconfig/20191203-054528-marostegui.json [05:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:22] !log Remove ar_comment triggers from s3 db1124:3313 - T234704 [05:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:28] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [05:49:29] (03CR) 10Andrew Bogott: [C: 03+2] Keystone wmtotp: remove unused arg [puppet] - 10https://gerrit.wikimedia.org/r/554233 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [05:50:03] !log cp3050: ats-be restart with proxy.config.http.server_session_sharing.pool=thread T238494 [05:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:08] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [05:52:15] (03PS1) 10Marostegui: site.pp: Remove puppet references for db2067 [puppet] - 10https://gerrit.wikimedia.org/r/554234 (https://phabricator.wikimedia.org/T233185) [05:53:13] (03PS1) 10Marostegui: wmnet: Remove production dns entries for db2067 [dns] - 10https://gerrit.wikimedia.org/r/554235 (https://phabricator.wikimedia.org/T233185) [05:53:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:29] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6958 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [05:53:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:51] (03CR) 10Ema: [C: 03+2] ATS: log Cache-Control and Set-Cookie responses from origin [puppet] - 10https://gerrit.wikimedia.org/r/552853 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [05:54:10] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references for db2067 [puppet] - 10https://gerrit.wikimedia.org/r/554234 (https://phabricator.wikimedia.org/T233185) (owner: 10Marostegui) [05:55:03] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production dns entries for db2067 [dns] - 10https://gerrit.wikimedia.org/r/554235 (https://phabricator.wikimedia.org/T233185) (owner: 10Marostegui) [05:56:13] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2067.codfw.wmnet - https://phabricator.wikimedia.org/T233185 (10Marostegui) a:05Marostegui→03Papaul [05:56:29] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2067.codfw.wmnet - https://phabricator.wikimedia.org/T233185 (10Marostegui) host ready for @Papaul to take over the last steps [05:56:39] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:56:50] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:57:23] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) 05Open→03Resolved All these hosts have been sent for decommissioning. Going to close this for now. [06:00:14] (03PS1) 10Marostegui: mariadb: Set db2065 to spare [puppet] - 10https://gerrit.wikimedia.org/r/554236 (https://phabricator.wikimedia.org/T239046) [06:00:16] !log Remove db2065 from tendril and zarcillo T239046 [06:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:21] T239046: decommission db2065.codfw.wmnet - https://phabricator.wikimedia.org/T239046 [06:01:30] (03CR) 10Marostegui: [C: 03+2] mariadb: Set db2065 to spare [puppet] - 10https://gerrit.wikimedia.org/r/554236 (https://phabricator.wikimedia.org/T239046) (owner: 10Marostegui) [06:04:17] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0125 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [06:07:27] !log Stop MySQL on db1062 for decommissioning T239188 [06:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:32] T239188: Decommission db1062.eqiad.wmnet - https://phabricator.wikimedia.org/T239188 [06:07:51] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.525 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [06:12:36] 10Operations, 10Maps: OSM Replication failed at eqiad and codfw - https://phabricator.wikimedia.org/T237228 (10Arjunaraoc) @MSantos , From the linked sub task, I find that the sub task has changed to "to-do". Is there a possibility to reset/reboot the current configuration, so that whatever works (for examp... [06:15:01] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5333 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [06:18:18] (03PS1) 10Andrew Bogott: keystone: add missing policy rule [puppet] - 10https://gerrit.wikimedia.org/r/554237 [06:19:27] !log volker-e@deploy1001 Started deploy [design/style-guide@8e08740]: Deploy design/style-guide: [06:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:36] !log volker-e@deploy1001 Finished deploy [design/style-guide@8e08740]: Deploy design/style-guide: (duration: 00m 08s) [06:19:38] (03CR) 10Andrew Bogott: [C: 03+2] keystone: add missing policy rule [puppet] - 10https://gerrit.wikimedia.org/r/554237 (owner: 10Andrew Bogott) [06:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:41] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [06:29:44] !log Deploy schema change on db1112 with replication (this will generate lag on s3 on labs) [06:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:14] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [06:34:26] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.08333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [06:49:42] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:51:20] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:41] (03PS1) 10Andrew Bogott: keystonehooks: remove a file that never should have been here [puppet] - 10https://gerrit.wikimedia.org/r/554238 [06:56:43] (03PS1) 10Andrew Bogott: wmfkeystonehooks: reconcile some differences with upstream keystone [puppet] - 10https://gerrit.wikimedia.org/r/554239 (https://phabricator.wikimedia.org/T237749) [06:57:38] (03CR) 10Andrew Bogott: [C: 03+2] keystonehooks: remove a file that never should have been here [puppet] - 10https://gerrit.wikimedia.org/r/554238 (owner: 10Andrew Bogott) [06:57:44] (03CR) 10jerkins-bot: [V: 04-1] wmfkeystonehooks: reconcile some differences with upstream keystone [puppet] - 10https://gerrit.wikimedia.org/r/554239 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [06:59:10] (03PS2) 10Andrew Bogott: wmfkeystonehooks: reconcile some differences with upstream keystone [puppet] - 10https://gerrit.wikimedia.org/r/554239 (https://phabricator.wikimedia.org/T237749) [07:00:19] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: reconcile some differences with upstream keystone [puppet] - 10https://gerrit.wikimedia.org/r/554239 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [07:10:12] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7542 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [07:10:44] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:22:36] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [07:39:06] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [07:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:07] (03PS5) 10Alexandros Kosiaris: prometheus: Scrape kube-proxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/554038 [07:43:09] (03PS4) 10Alexandros Kosiaris: prometheus: Allow keeping FQDNs as targets [puppet] - 10https://gerrit.wikimedia.org/r/554059 [07:43:11] (03PS4) 10Alexandros Kosiaris: calico: Keep FQDNs for calico felix prometheus targets [puppet] - 10https://gerrit.wikimedia.org/r/554060 [07:43:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/554038 (owner: 10Alexandros Kosiaris) [07:53:20] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Traffic: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10hashar) Magic! Danke Schon! [07:57:37] (03PS1) 10Marostegui: mariadb: Add wmf-mariadb104 as a possibility [puppet] - 10https://gerrit.wikimedia.org/r/554240 (https://phabricator.wikimedia.org/T193224) [08:06:55] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10elukey) Hello! I took the liberty to ack a lot of criticals/unknowns in icinga that were related to these new hosts, IIUC these are not in production :) [08:14:22] !log volker-e@deploy1001 Started deploy [design/style-guide@7978f0d]: Deploy design/style-guide: [08:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:29] !log volker-e@deploy1001 Finished deploy [design/style-guide@7978f0d]: Deploy design/style-guide: (duration: 00m 06s) [08:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:25] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [08:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:34] (03PS1) 10Andrew Bogott: keystone ocata: include custom wsgi script [puppet] - 10https://gerrit.wikimedia.org/r/554241 (https://phabricator.wikimedia.org/T237749) [08:19:08] !log apply calico rules for eventgate-logging-external. T236386 [08:19:08] (03CR) 10jerkins-bot: [V: 04-1] keystone ocata: include custom wsgi script [puppet] - 10https://gerrit.wikimedia.org/r/554241 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [08:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:13] T236386: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 [08:20:05] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [08:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:15] (03PS2) 10Andrew Bogott: keystone ocata: include custom wsgi script [puppet] - 10https://gerrit.wikimedia.org/r/554241 (https://phabricator.wikimedia.org/T237749) [08:22:56] (03PS1) 10Muehlenhoff: Extend MOU for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/554242 [08:23:13] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10akosiaris) >>! In T236386#5706868, @Ottomata wrote: > @akosiaris I merged and applied https://gerrit.wikimedia.org/r/c/opera... [08:23:55] (03CR) 10Andrew Bogott: [C: 03+2] keystone ocata: include custom wsgi script [puppet] - 10https://gerrit.wikimedia.org/r/554241 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [08:25:46] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1062 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554243 (https://phabricator.wikimedia.org/T239188) [08:26:51] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db1062 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554243 (https://phabricator.wikimedia.org/T239188) (owner: 10Marostegui) [08:27:43] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1062 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554243 (https://phabricator.wikimedia.org/T239188) (owner: 10Marostegui) [08:29:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1062 from config T239188 (duration: 01m 08s) [08:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:21] T239188: Decommission db1062.eqiad.wmnet - https://phabricator.wikimedia.org/T239188 [08:29:25] (03PS1) 10Joal: Upgrade mw-history-reduced snapshot in aqs conf [puppet] - 10https://gerrit.wikimedia.org/r/554244 [08:30:25] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1062 from config T239188 (duration: 01m 02s) [08:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:48] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/554242 (owner: 10Muehlenhoff) [08:33:43] (03CR) 10Elukey: [C: 03+2] Upgrade mw-history-reduced snapshot in aqs conf [puppet] - 10https://gerrit.wikimedia.org/r/554244 (owner: 10Joal) [08:34:42] thanks elukey --^ [08:35:56] !log cp3050: set cache.max_open_read_retries=-1 and proxy.config.http.cache.max_open_write_retries=1 (default values) T238494 [08:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:01] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [08:44:45] (03PS2) 10Alexandros Kosiaris: RBAC: Unify rules into 1 file [deployment-charts] - 10https://gerrit.wikimedia.org/r/551251 [08:44:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] RBAC: Unify rules into 1 file [deployment-charts] - 10https://gerrit.wikimedia.org/r/551251 (owner: 10Alexandros Kosiaris) [08:45:04] (03Merged) 10jenkins-bot: RBAC: Unify rules into 1 file [deployment-charts] - 10https://gerrit.wikimedia.org/r/551251 (owner: 10Alexandros Kosiaris) [08:45:20] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [08:45:21] !log elukey@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) [08:45:21] !log Restart php-fpm on mw[1330-1333].eqiad.wmnet [08:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:29] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [08:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [08:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:12] PROBLEM - Prometheus prometheus1003/k8s restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [08:51:43] (03CR) 10Jcrespo: [C: 03+1] "Should we remove mysql 56 and 57 and add percona-server?" [puppet] - 10https://gerrit.wikimedia.org/r/554240 (https://phabricator.wikimedia.org/T193224) (owner: 10Marostegui) [08:56:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] "noop in staging, eqiad, codfw, \o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/551251 (owner: 10Alexandros Kosiaris) [08:56:59] (03PS2) 10Alexandros Kosiaris: RBAC: Allow prometheus access to nodes resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/551266 (https://phabricator.wikimedia.org/T238410) [08:57:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] RBAC: Allow prometheus access to nodes resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/551266 (https://phabricator.wikimedia.org/T238410) (owner: 10Alexandros Kosiaris) [08:58:02] (03Merged) 10jenkins-bot: RBAC: Allow prometheus access to nodes resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/551266 (https://phabricator.wikimedia.org/T238410) (owner: 10Alexandros Kosiaris) [08:58:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:59:45] (03PS2) 10Marostegui: mariadb: Add wmf-mariadb104 as a possibility [puppet] - 10https://gerrit.wikimedia.org/r/554240 (https://phabricator.wikimedia.org/T193224) [09:00:10] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [09:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:40] (03PS3) 10Marostegui: mariadb: Add wmf-mariadb104,percona 8.0 as a possibility [puppet] - 10https://gerrit.wikimedia.org/r/554240 (https://phabricator.wikimedia.org/T193224) [09:02:17] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [09:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:59] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [09:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:03] (03CR) 10Jcrespo: [C: 03+1] mariadb: Add wmf-mariadb104,percona 8.0 as a possibility [puppet] - 10https://gerrit.wikimedia.org/r/554240 (https://phabricator.wikimedia.org/T193224) (owner: 10Marostegui) [09:03:27] (03CR) 10Jcrespo: [C: 03+1] mariadb: Add wmf-mariadb104,percona 8.0 as a possibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554240 (https://phabricator.wikimedia.org/T193224) (owner: 10Marostegui) [09:04:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Add wmf-mariadb104,percona 8.0 as a possibility [puppet] - 10https://gerrit.wikimedia.org/r/554240 (https://phabricator.wikimedia.org/T193224) (owner: 10Marostegui) [09:05:32] !log downtime new logstash hosts in codfw/eqiad until thurs [09:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:33] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10fgiunchedi) >>! In T234854#5708171, @elukey wrote: > Hello! I took the liberty to ack a lot of criticals/unknowns in icinga that were related to these new hosts, IIUC these are not in production :) Thanks! I'v... [09:08:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:11:35] (03CR) 10Jbond: "lgtm just missed one" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [09:11:41] (03PS1) 10Marostegui: service.pp: Change vendor to: percona-server [puppet] - 10https://gerrit.wikimedia.org/r/554252 [09:12:13] (03CR) 10Jcrespo: [C: 03+1] service.pp: Change vendor to: percona-server [puppet] - 10https://gerrit.wikimedia.org/r/554252 (owner: 10Marostegui) [09:12:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:12:37] (03CR) 10Marostegui: [C: 03+2] service.pp: Change vendor to: percona-server [puppet] - 10https://gerrit.wikimedia.org/r/554252 (owner: 10Marostegui) [09:13:24] RECOVERY - Prometheus prometheus1003/k8s restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [09:14:48] (03CR) 10Filippo Giunchedi: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554115 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [09:17:21] (03CR) 10Jbond: [C: 04-1] "lgtm one copy/paste error and a nitpick" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554215 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [09:18:10] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.9917 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [09:21:26] (03CR) 10Filippo Giunchedi: [C: 03+2] codfw: remove old wezen record [dns] - 10https://gerrit.wikimedia.org/r/554083 (https://phabricator.wikimedia.org/T224564) (owner: 10Volans) [09:21:30] (03PS2) 10Filippo Giunchedi: codfw: remove old wezen record [dns] - 10https://gerrit.wikimedia.org/r/554083 (https://phabricator.wikimedia.org/T224564) (owner: 10Volans) [09:21:32] (03PS3) 10Jbond: add dcausse user back [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [09:21:49] (03CR) 10jerkins-bot: [V: 04-1] add dcausse user back [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [09:21:52] (03CR) 10Hashar: "> Thanks! This is my first package release so I'm definitely feeling around in the dark a little." [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 (owner: 10Hashar) [09:22:10] !log Roll restart php-fpm mw[1240-1258,1261-1275,1319-1333].eqiad.wmnet [09:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:25] (03CR) 10Jbond: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [09:24:35] (03CR) 10Jbond: [C: 04-1] add dcausse user back [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [09:27:46] (03CR) 10Filippo Giunchedi: "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [09:35:47] (03PS3) 10Hashar: Update debian/changelog to point to Buster [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 [09:36:28] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:39:32] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5292 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [09:41:04] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:41:12] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:42:03] (03PS3) 10Jcrespo: check_mariadb.py: Update bacula logic to the latest Bacula class [puppet] - 10https://gerrit.wikimedia.org/r/552860 [09:42:47] (03CR) 10jerkins-bot: [V: 04-1] check_mariadb.py: Update bacula logic to the latest Bacula class [puppet] - 10https://gerrit.wikimedia.org/r/552860 (owner: 10Jcrespo) [09:44:01] (03CR) 10Hashar: "The package builds fine with `buster` (no need to use `buster-wikimedia` apparently). The only reason the CI job is due to lintian:" [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 (owner: 10Hashar) [09:44:16] (03PS2) 10Volans: codfw: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554082 (https://phabricator.wikimedia.org/T239597) [09:45:35] (03CR) 10Volans: [C: 03+2] codfw: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554082 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [09:49:01] (03CR) 10Cparle: [C: 04-2] Turn off redirect on exact search match for commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) (owner: 10Cparle) [09:52:04] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [09:52:08] (03CR) 10Muehlenhoff: "Using buster-wikimedia as the target in debian/changelog fine here (as it's currently an internal package), maybe Kunal will also package " [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 (owner: 10Hashar) [09:56:43] (03PS1) 10Ema: cache: reimage cp1083 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/554256 (https://phabricator.wikimedia.org/T227432) [09:57:56] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:59:42] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:00:23] (03PS7) 10Jbond: profile::idp: add jmx prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) [10:01:01] (03CR) 10Jbond: "Thanks, See inline, happy to update the hostname variable" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [10:01:21] !log depool cp1083 and reimage as text_ats T227432 [10:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:27] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [10:01:38] (03CR) 10Ema: [C: 03+2] cache: reimage cp1083 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/554256 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:03:56] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1083.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [10:09:03] (03PS1) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [10:10:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [10:10:56] (03PS2) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [10:10:58] (03CR) 10jerkins-bot: [V: 04-1] bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [10:12:12] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance=cp2023:9536 site=codfw tunnel={cp1083_v4,cp1083_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:12:25] 10Operations, 10observability, 10User-fgiunchedi: Reimage wezen to Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10MoritzMuehlenhoff) [10:12:48] (03PS3) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [10:12:50] (03CR) 10jerkins-bot: [V: 04-1] bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [10:13:46] PROBLEM - Aggregate IPsec Tunnel Status esams on icinga1001 is CRITICAL: instance=cp3064:9536 site=esams tunnel={cp1083_v4,cp1083_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:14:36] (03CR) 10jerkins-bot: [V: 04-1] bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [10:14:41] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::idp: add jmx prometheus exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [10:16:51] 10Operations, 10Puppet, 10User-jbond: facter3: use structured facts - https://phabricator.wikimedia.org/T222160 (10jbond) A quick and dirty count of top level global facts ` git grep -ho '$::\w\+\s' | sort -u ` [10:17:57] (03CR) 10Jbond: [C: 03+2] profile::idp: add jmx prometheus exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [10:18:13] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [10:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:34] (03PS4) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [10:19:47] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.975 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:20:22] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:20:24] (03CR) 10ArielGlenn: [C: 03+1] "This seems fine to me (and noncontroversial), is it waiting on anything? Here's my vote just in case." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552565 (https://phabricator.wikimedia.org/T238921) (owner: 10Daniel Kinzler) [10:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:29] (03CR) 10jerkins-bot: [V: 04-1] bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [10:22:30] (03PS1) 10Jbond: profile::idp: update file source to content [puppet] - 10https://gerrit.wikimedia.org/r/554259 [10:24:53] (03CR) 10Jbond: [C: 03+2] profile::idp: update file source to content [puppet] - 10https://gerrit.wikimedia.org/r/554259 (owner: 10Jbond) [10:25:07] (03CR) 10Daniel Kinzler: "> Patch Set 2: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552565 (https://phabricator.wikimedia.org/T238921) (owner: 10Daniel Kinzler) [10:26:31] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:27:19] RECOVERY - Aggregate IPsec Tunnel Status esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:28:22] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5702143, @Krinkle wrote: > I do have a gut-feeling, though, that these two example you mention cannot (should... [10:29:11] (03PS5) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [10:31:02] (03CR) 10jerkins-bot: [V: 04-1] bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [10:31:25] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1083.eqiad.wmnet'] ` and were **ALL** successful. [10:33:46] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:35:50] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:36:48] (03CR) 10Jbond: [C: 03+2] backup::host: refactor [puppet] - 10https://gerrit.wikimedia.org/r/547568 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [10:36:59] (03PS4) 10Jbond: backup::host: refactor [puppet] - 10https://gerrit.wikimedia.org/r/547568 (https://phabricator.wikimedia.org/T221083) [10:37:09] (03PS5) 10Jbond: backup::host: use fqdn_rand_string for password generation [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) [10:37:32] !log pool cp1083 with ATS backend T227432 [10:37:34] (03PS6) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [10:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:39] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [10:39:01] (03CR) 10jerkins-bot: [V: 04-1] backup::host: use fqdn_rand_string for password generation [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [10:39:24] (03CR) 10jerkins-bot: [V: 04-1] bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [10:40:34] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:41:09] (03PS7) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [10:42:58] (03CR) 10jerkins-bot: [V: 04-1] bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [10:46:08] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:55:08] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.075 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:59:44] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [11:00:23] !log Updated operations-puppet-tests-stretch-docker CI job to use tox 3.10.0 and support various python 3 versions [11:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:31] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Traffic: https://releases-jenkins.wikimedia.org yields a 502 unreachable - https://phabricator.wikimedia.org/T239629 (10Aklapper) [11:26:44] (03PS1) 10Alexandros Kosiaris: varnish: Add Go-http-client/2.0 to bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/554269 [11:30:45] (03CR) 10Effie Mouzeli: [C: 03+1] varnish: Add Go-http-client/2.0 to bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/554269 (owner: 10Alexandros Kosiaris) [11:31:19] !log refresh kibana fields for logstash-* [11:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:16] (03CR) 10Hashar: "recheck (fixed pip cache)" [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [11:36:53] !log Updated operations-puppet-tests-stretch-docker to fix pip cache directory [11:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:14] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale-full only: 4 (archiva1001, ...), Fresh: 95 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [11:40:34] (03CR) 10Hashar: CI - taskgen: add black tests for python2 and python3 files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [11:41:55] (03PS2) 10Alexandros Kosiaris: varnish: Add Go-http-client/2.0 to bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/554269 [11:45:54] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech [11:45:54] ki/Services/Monitoring/recommendation_api [11:47:36] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [11:52:37] (03CR) 10Ema: [C: 03+1] "+1 if tests are green :)" [puppet] - 10https://gerrit.wikimedia.org/r/554269 (owner: 10Alexandros Kosiaris) [11:58:02] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b346ebf]: Mirror html2html traffic to Parsoid/PHP - T229015 T239643 [11:58:02] !log mobrovac@deploy1001 deploy aborted: Mirror html2html traffic to Parsoid/PHP - T229015 T239643 (duration: 00m 00s) [11:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:10] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [11:58:10] T239643: Bugs in PHP port of LanguageConverter - https://phabricator.wikimedia.org/T239643 [11:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:33] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b346ebf]: Mirror html2html traffic to Parsoid/PHP - T229015 T239643 [11:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:55] jouncebot: refresh [11:59:56] I refreshed my knowledge about deployments. [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191203T1200). [12:00:04] Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:20] that was close [12:00:24] I can do swat today [12:00:44] (03CR) 10Ladsgroup: [C: 03+2] Set item term store for read new up to Q1000 for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554164 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:01:33] lol, well played Amir1 [12:01:49] :D [12:03:16] (03PS2) 10Ladsgroup: Set item term store for read new up to Q1000 for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554164 (https://phabricator.wikimedia.org/T225057) [12:03:34] (03CR) 10Ladsgroup: [C: 03+2] Set item term store for read new up to Q1000 for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554164 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:03:43] :D [12:04:27] (03Merged) 10jenkins-bot: Set item term store for read new up to Q1000 for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554164 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:09:31] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:554164|Set read new for term store for items for client wikis up to Q1000 (T225057)]] (duration: 01m 00s) [12:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:39] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [12:09:50] !log EU SWAT is done [12:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:00] (03PS1) 10Muehlenhoff: Add CAS authentication to debmonitor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 [12:12:02] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b346ebf]: Mirror html2html traffic to Parsoid/PHP - T229015 T239643 (duration: 13m 29s) [12:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:08] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [12:12:09] T239643: Bugs in PHP port of LanguageConverter - https://phabricator.wikimedia.org/T239643 [12:12:38] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b346ebf]: Mirror html2html traffic to Parsoid/PHP, take #2 [12:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:46] PROBLEM - Maps - OSM synchronization lag - eqiad on icinga1001 is CRITICAL: 3.154e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [12:18:14] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:19:28] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:20:02] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:21:08] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:22:14] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h& [12:22:14] r-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [12:23:56] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b346ebf]: Mirror html2html traffic to Parsoid/PHP, take #2 (duration: 11m 17s) [12:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:55] (03CR) 10Volans: [C: 04-1] "Looks good in general, just one approach that is an anti-pattern, the re-read of the config that is already available via Django. See inli" (035 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [12:28:08] !log mobrovac@deploy1001 Started deploy [restbase/deploy@41bb230]: Log all html2html errors coming from Parsoid/PHP - T239643 [12:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:13] T239643: Bugs in PHP port of LanguageConverter - https://phabricator.wikimedia.org/T239643 [12:31:28] (03PS4) 10Phamhi: labmon: update graphite-web to be compatible with buster/stretch [puppet] - 10https://gerrit.wikimedia.org/r/554115 (https://phabricator.wikimedia.org/T224585) [12:31:47] (03CR) 10Phamhi: [V: 03+2] labmon: update graphite-web to be compatible with buster/stretch [puppet] - 10https://gerrit.wikimedia.org/r/554115 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:35:42] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:37:30] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:38:10] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6458 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [12:41:16] (03PS8) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [12:41:18] (03PS1) 10Jcrespo: bacula: Increase expected freshness of monthly full backups [puppet] - 10https://gerrit.wikimedia.org/r/554281 (https://phabricator.wikimedia.org/T234900) [12:42:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:42:49] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@41bb230]: Log all html2html errors coming from Parsoid/PHP - T239643 (duration: 14m 41s) [12:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:54] T239643: Bugs in PHP port of LanguageConverter - https://phabricator.wikimedia.org/T239643 [12:43:04] (03CR) 10jerkins-bot: [V: 04-1] bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [12:43:15] (03CR) 10Marostegui: bacula: Increase expected freshness of monthly full backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554281 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [12:44:40] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 647.6 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [12:45:50] (03PS2) 10Jcrespo: bacula: Increase expected freshness of monthly full backups [puppet] - 10https://gerrit.wikimedia.org/r/554281 (https://phabricator.wikimedia.org/T234900) [12:46:54] (03PS3) 10Jcrespo: bacula: Increase expected freshness of monthly full backups [puppet] - 10https://gerrit.wikimedia.org/r/554281 (https://phabricator.wikimedia.org/T234900) [12:49:03] (03PS5) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [12:49:21] (03CR) 10Jcrespo: [C: 03+2] bacula: Increase expected freshness of monthly full backups [puppet] - 10https://gerrit.wikimedia.org/r/554281 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [12:49:42] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:51:20] (03PS3) 10Alexandros Kosiaris: varnish: Add Go-http-client/2.0 to bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/554269 [12:52:23] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [12:56:50] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [12:59:50] (03CR) 10BBlack: [C: 03+1] varnish: Add Go-http-client/2.0 to bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/554269 (owner: 10Alexandros Kosiaris) [13:00:17] 10Operations, 10Puppet, 10Patch-For-Review, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond) The CR to black is now working and you can see what changes are highlighted in this [[ https://integration.wikimedia.org/ci/job/operations-puppet-tests-s... [13:00:43] mobrovac: looks like a whole bunch of error messages coming in to the tune of 12k/s, related to your change I take it ? [13:00:55] mobrovac: mediawiki error messages that is [13:01:00] (03CR) 10Jbond: "thanks hashar" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [13:01:16] godog: [13:01:24] godog: that'd be weird, very weird [13:01:34] godog: can you inspect/ see them ? [13:01:57] mobrovac: yup doing that now [13:02:23] restbase logs are not overloaded [13:04:30] mobrovac: https://phabricator.wikimedia.org/P9800 [13:04:44] (03PS9) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [13:04:46] looking [13:04:47] Notice: Undefined property: stdClass::$disabled" from parsoid-php looks like [13:05:29] ah yes godog, but 12k/s is a bit too much [13:05:51] I 100% agree mobrovac ! [13:06:09] haha [13:06:30] (03CR) 10jerkins-bot: [V: 04-1] bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:06:44] godog: i can deactivate this, but i enabled it in consensus with subbu & co, so that they can debug this specific aspect of parsoid/php [13:06:48] ah yeah this seems to be popular [13:06:50] ErrorException from line 143 of /srv/deployment/parsoid/deploy-cache/revs/743efb032da50284c64698c114b23c91f411825f/vendor/wikimedia/langconv/src/FST.php: PHP Notice: Trying to get property 'blen' of non-object" [13:07:30] godog: afaik, they should dpeloy fixes for these today [13:07:33] mobrovac: 12k logs/s is more than logstash can ingest atm though [13:07:39] i see [13:07:52] ok godog, i'll deactivate it then i guess for now [13:08:05] if there's a quick fix we can do to mute the error maybe? [13:08:12] 10Operations, 10Puppet, 10Patch-For-Review, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10ArielGlenn) Line length needs to be tweaked to conform with our puppet settings for flake8 I guess. [13:09:05] not that i know of [13:09:14] i'll just stop sending rb traffic to it [13:10:09] thank you, appreciate it, btw this is a reoccurrence of what happened yesterday with zhwiki but sans the 500s afaict [13:10:31] mobrovac, godog cscott worrked on some fixes, but we need to review it and deploy it .. and after that mirroring will be useful in helping him find any other issues and fixing them. [13:11:23] subbu: awesome! [13:11:37] if this is an iterative process we'll need to find a way to rate limit errors somehow [13:11:40] !log mobrovac@deploy1001 Started deploy [restbase/deploy@92acf1e]: Revert mirroring html2html traffic to PHP - T239643 [13:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:45] T239643: Bugs in PHP port of LanguageConverter - https://phabricator.wikimedia.org/T239643 [13:12:16] godog, yes indeed. but, scott thinks he got the bulk of the errors already. [13:13:15] neat [13:13:40] (03PS10) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [13:16:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/19744/cp4022.ulsfo.wmnet/ says ok, merging" [puppet] - 10https://gerrit.wikimedia.org/r/554269 (owner: 10Alexandros Kosiaris) [13:16:43] thinking out loud, could we mirror a sample of traffic ? or a smaller sample if we're already doing sampling ? [13:17:26] godog the mirroring feature has that ability .. mobrovac does it in this specific case? [13:19:05] it was mirroring all html2html reqs given the idea was to get all errors [13:19:35] hehehe we got all errors alright [13:19:49] this is the graph I'm looking at btw https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-cluster=logstash&var-kafka_broker=All&var-disk_device=All&from=now-3h&to=now&fullscreen&panelId=46 [13:20:34] mamma mia [13:21:34] hehe mammamia indeed mobrovac [13:22:21] (03CR) 10Jcrespo: "@akosiaris let me know what you think: https://puppet-compiler.wmflabs.org/compiler1002/19745/backup1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:22:23] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@92acf1e]: Revert mirroring html2html traffic to PHP - T239643 (duration: 10m 43s) [13:22:25] mobrovac, once we've deployed all fixes for all current issues, mirror 10% and we can go from there. cscott will co-ordinate on this with you over phab / email since i don't think you two overlap on irc given hours. [13:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:28] T239643: Bugs in PHP port of LanguageConverter - https://phabricator.wikimedia.org/T239643 [13:22:52] ok subbu, godog, mirroring reverted, the rate should go back down now [13:23:05] ty [13:23:17] mobrovac: yup we're back, thank you! logstash is catching up [13:24:30] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0625 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [13:30:13] 10Operations, 10Traffic: Make DNS operations resilient against predictable failures - https://phabricator.wikimedia.org/T239711 (10BBlack) [13:32:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1112 after schema change', diff saved to https://phabricator.wikimedia.org/P9802 and previous config saved to /var/cache/conftool/dbconfig/20191203-133231-marostegui.json [13:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:26] (03PS1) 10Jcrespo: bacula: Setup new pool for databases as well as its configuration [puppet] - 10https://gerrit.wikimedia.org/r/554286 (https://phabricator.wikimedia.org/T238048) [13:34:43] (03PS1) 10Ema: varnish: fix order of parameters in tests helper script [puppet] - 10https://gerrit.wikimedia.org/r/554287 [13:35:14] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:36:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] varnish: fix order of parameters in tests helper script [puppet] - 10https://gerrit.wikimedia.org/r/554287 (owner: 10Ema) [13:36:25] (03CR) 10jerkins-bot: [V: 04-1] bacula: Setup new pool for databases as well as its configuration [puppet] - 10https://gerrit.wikimedia.org/r/554286 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:36:52] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:37:02] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:40:56] (03PS1) 10Jcrespo: database-backups: Use a different pool for dbatabase backups [puppet] - 10https://gerrit.wikimedia.org/r/554288 (https://phabricator.wikimedia.org/T238048) [13:42:16] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:42:22] (03PS1) 10Andrew Bogott: wmfkeystonehooks: designate domain orig project when creating default domain [puppet] - 10https://gerrit.wikimedia.org/r/554289 [13:42:32] (03Abandoned) 10Jcrespo: bacula: Setup separate pool and defaults for database backups on backup1001 [puppet] - 10https://gerrit.wikimedia.org/r/550671 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:43:31] (03PS2) 10Andrew Bogott: wmfkeystonehooks: designate domain orig project when creating default domain [puppet] - 10https://gerrit.wikimedia.org/r/554289 [13:44:55] (03PS2) 10Jcrespo: bacula: Setup new pool for databases as well as its configuration [puppet] - 10https://gerrit.wikimedia.org/r/554286 (https://phabricator.wikimedia.org/T238048) [13:45:07] (03PS1) 10BBlack: Switch to digicert-2019a in esams, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/554291 (https://phabricator.wikimedia.org/T238494) [13:45:21] (03PS6) 10Jbond: backup::host: use fqdn_rand_string for password generation [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) [13:46:54] (03CR) 10jerkins-bot: [V: 04-1] bacula: Setup new pool for databases as well as its configuration [puppet] - 10https://gerrit.wikimedia.org/r/554286 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:47:26] (03PS1) 10BBlack: refactor authdns_servers to include host IPs [puppet] - 10https://gerrit.wikimedia.org/r/554292 (https://phabricator.wikimedia.org/T239711) [13:47:28] (03PS1) 10BBlack: authdns: add explicit hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/554293 (https://phabricator.wikimedia.org/T239711) [13:47:30] (03PS1) 10BBlack: authdns: ferm for ssh hardcoded as well [puppet] - 10https://gerrit.wikimedia.org/r/554294 (https://phabricator.wikimedia.org/T239711) [13:48:15] (03PS3) 10Jcrespo: bacula: Setup new pool for databases as well as its configuration [puppet] - 10https://gerrit.wikimedia.org/r/554286 (https://phabricator.wikimedia.org/T238048) [13:49:13] (03CR) 10jerkins-bot: [V: 04-1] refactor authdns_servers to include host IPs [puppet] - 10https://gerrit.wikimedia.org/r/554292 (https://phabricator.wikimedia.org/T239711) (owner: 10BBlack) [13:50:45] (03PS2) 10Jcrespo: database-backups: Use a different pool for dbatabase backups [puppet] - 10https://gerrit.wikimedia.org/r/554288 (https://phabricator.wikimedia.org/T238048) [13:54:17] (03PS2) 10BBlack: refactor authdns_servers to include host IPs [puppet] - 10https://gerrit.wikimedia.org/r/554292 (https://phabricator.wikimedia.org/T239711) [13:54:21] (03PS2) 10BBlack: authdns: add explicit hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/554293 (https://phabricator.wikimedia.org/T239711) [13:54:23] (03PS2) 10BBlack: authdns: ferm for ssh hardcoded as well [puppet] - 10https://gerrit.wikimedia.org/r/554294 (https://phabricator.wikimedia.org/T239711) [13:54:48] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:55:20] (03PS4) 10Jcrespo: bacula: Setup new pool for databases as well as its configuration [puppet] - 10https://gerrit.wikimedia.org/r/554286 (https://phabricator.wikimedia.org/T238048) [13:56:00] (03CR) 10jerkins-bot: [V: 04-1] refactor authdns_servers to include host IPs [puppet] - 10https://gerrit.wikimedia.org/r/554292 (https://phabricator.wikimedia.org/T239711) (owner: 10BBlack) [13:56:34] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:56:39] is there's some network stuff going on today? we've had multiple OSPF flaps.. [13:56:43] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, and 2 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10fgiunchedi) >>! In T189333#5645365, @EBernhardson wrote: >>>! In T189333#5488005, @Krinkle wrote: >>>>! In T189333#5483346, @fgiunchedi w... [13:56:47] !log cp-esams: disable puppet in preparation of digicert-2019a cert switch https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554291/ T238494 [13:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:52] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [13:57:43] (03CR) 10Ottomata: "Ok, stream.wm.org added in SAN in cert. I'll be deploying the change in just a bit." [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [13:59:07] (03CR) 10Ema: [C: 03+2] Switch to digicert-2019a in esams, temporarily [puppet] - 10https://gerrit.wikimedia.org/r/554291 (https://phabricator.wikimedia.org/T238494) (owner: 10BBlack) [13:59:30] (03PS3) 10BBlack: refactor authdns_servers to include host IPs [puppet] - 10https://gerrit.wikimedia.org/r/554292 (https://phabricator.wikimedia.org/T239711) [13:59:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] prometheus: Allow keeping FQDNs as targets [puppet] - 10https://gerrit.wikimedia.org/r/554059 (owner: 10Alexandros Kosiaris) [14:00:19] !log cp3050: depool and switch to digicert-2019a T238494 [14:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:06] (03CR) 10Muehlenhoff: "Thanks for the quick review!" (034 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [14:02:19] (03PS2) 10Muehlenhoff: Add CAS authentication to debmonitor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 [14:04:14] (03CR) 10jerkins-bot: [V: 04-1] Add CAS authentication to debmonitor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [14:04:51] (03PS3) 10BBlack: authdns: add explicit hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/554293 (https://phabricator.wikimedia.org/T239711) [14:04:53] (03PS3) 10BBlack: authdns: ferm for ssh hardcoded as well [puppet] - 10https://gerrit.wikimedia.org/r/554294 (https://phabricator.wikimedia.org/T239711) [14:06:14] !log repool cp3050 with digicert-2019a T238494 [14:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:20] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [14:09:36] (03PS1) 10Ottomata: Add IPv6 calico rules for eventgate-logging-external -> kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/554295 (https://phabricator.wikimedia.org/T236386) [14:10:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] md: Globally set lower sync limits [puppet] - 10https://gerrit.wikimedia.org/r/549847 (https://phabricator.wikimedia.org/T237197) (owner: 10Alexandros Kosiaris) [14:10:30] (03PS3) 10Muehlenhoff: Add CAS authentication to debmonitor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 [14:10:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Keep FQDNs for calico felix prometheus targets [puppet] - 10https://gerrit.wikimedia.org/r/554060 (owner: 10Alexandros Kosiaris) [14:13:37] !log cp-esams: re-enable puppet, switch to digicert-2019a certs https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554291/ T238494 [14:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:43] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [14:14:32] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/19752/backup1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/554286 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [14:14:48] (03PS3) 10Jcrespo: database-backups: Use a different pool for dbatabase backups [puppet] - 10https://gerrit.wikimedia.org/r/554288 (https://phabricator.wikimedia.org/T238048) [14:16:48] (03PS4) 10Jcrespo: database-backups: Use a different pool for database backups [puppet] - 10https://gerrit.wikimedia.org/r/554288 (https://phabricator.wikimedia.org/T238048) [14:17:03] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [14:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:43] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:18:08] (03CR) 10Jcrespo: "It would be nice to have this deployed today, as database full backups are scheduled to happen tonight or tomorrow, I think." [puppet] - 10https://gerrit.wikimedia.org/r/554288 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [14:19:35] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [14:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:47] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:23:33] whatt [14:23:36] again? [14:24:01] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:24:18] ah no this is api [14:24:20] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [14:24:37] !log all cp-esams hosts switched to digicert-2019a certs T238494 [14:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:45] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [14:24:51] gilles: ^ [14:28:17] it is starting to drop on the graph I shared, let's see how low it goes [14:33:13] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:36:05] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:36:12] we seem to be having some link issues with eqiad<->esams as well [14:36:48] that BFD flap above is on the link to cr2-esams [14:37:21] and it seems like there have been multiple BFD flaps in various places this morning [14:37:31] well BFD or OSPF [14:38:09] but all seems esams-related [14:40:39] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:41:21] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:42:13] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:42:15] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:43:53] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:45:19] (03CR) 10Volans: "A question inline from the docs, one minor thing to add here and one for a puppet patch. Looks ok otherwise." (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [14:51:05] (03PS4) 10BBlack: refactor authdns_servers to include host IPs [puppet] - 10https://gerrit.wikimedia.org/r/554292 (https://phabricator.wikimedia.org/T239711) [14:51:07] (03PS4) 10BBlack: authdns: add explicit hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/554293 (https://phabricator.wikimedia.org/T239711) [14:51:09] (03PS4) 10BBlack: authdns: ferm for ssh hardcoded as well [puppet] - 10https://gerrit.wikimedia.org/r/554294 (https://phabricator.wikimedia.org/T239711) [14:52:00] !log swift eqiad-prod: final weight to ms-be105[7-9] - T237438 [14:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:06] T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 [14:52:59] (03PS4) 10BBlack: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 10% [puppet] - 10https://gerrit.wikimedia.org/r/550869 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [14:54:06] (03CR) 10BBlack: [C: 03+2] vcl: Bump TLSv1/TLSv1.1 pageview replacement to 10% [puppet] - 10https://gerrit.wikimedia.org/r/550869 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [14:59:33] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) Confirmed that the latest patch 554115 fixed the graphite-web issue. [15:04:19] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [15:05:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/19758/ points out that this is probably ok, so +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553423 (owner: 10Dzahn) [15:08:41] (03PS2) 10Alexandros Kosiaris: ATS: switch OTRS to use TLS and discovery record [puppet] - 10https://gerrit.wikimedia.org/r/553424 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [15:09:03] 10Operations, 10Wikimedia-Logstash, 10observability, 10Core Platform Team Legacy (Watching / External), 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10fgiunchedi) We've been working with service owners to fix the obv... [15:09:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] ATS: switch OTRS to use TLS and discovery record [puppet] - 10https://gerrit.wikimedia.org/r/553424 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [15:10:08] (03PS1) 10Jbond: admin: add tests for system users [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) [15:11:39] (03PS1) 10Filippo Giunchedi: logstash: bump max fields to 4096 [puppet] - 10https://gerrit.wikimedia.org/r/554303 (https://phabricator.wikimedia.org/T180051) [15:11:42] (03PS2) 10Jbond: admin: add tests for system users [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) [15:12:40] (03PS3) 10Jbond: admin: add tests for system users [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) [15:13:01] (03CR) 10Ema: [C: 04-1] Public cache routing for eventgate-logging-external (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:13:55] (03PS4) 10Jbond: admin: add tests for system users [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) [15:14:57] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [15:14:58] (03CR) 10Ottomata: Public cache routing for eventgate-logging-external (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:15:16] (03PS5) 10Ottomata: Public cache routing for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) [15:15:41] (03CR) 10jerkins-bot: [V: 04-1] admin: add tests for system users [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) (owner: 10Jbond) [15:15:53] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:16:41] (03CR) 10Herron: [C: 03+1] logstash: bump max fields to 4096 [puppet] - 10https://gerrit.wikimedia.org/r/554303 (https://phabricator.wikimedia.org/T180051) (owner: 10Filippo Giunchedi) [15:17:37] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:17:43] (03PS5) 10Jbond: admin: add tests for system users [puppet] - 10https://gerrit.wikimedia.org/r/554302 (https://phabricator.wikimedia.org/T239686) [15:17:49] (03PS6) 10Ottomata: Public cache routing for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) [15:18:41] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:20:19] !log executing sudo cumin -b6 -s 20 -p 95 'A:mw-api-eqiad' 'restart-php7.2-fpm' on cumin1001 [15:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:23] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:21:43] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5792 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:22:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add IPv6 calico rules for eventgate-logging-external -> kafka (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554295 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:24:03] (03PS2) 10Ottomata: Add IPv6 calico rules for eventgate-logging-external -> kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/554295 (https://phabricator.wikimedia.org/T236386) [15:24:17] (03CR) 10Ottomata: "Oops copypasta. Done." [deployment-charts] - 10https://gerrit.wikimedia.org/r/554295 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:24:35] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:24:57] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Buster - https://phabricator.wikimedia.org/T224585 (10MoritzMuehlenhoff) [15:25:09] (03PS2) 10Bstorm: toolforge: set a new package builder role [puppet] - 10https://gerrit.wikimedia.org/r/549211 [15:25:15] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.04167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:26:15] (03PS1) 10Jbond: admin::tests: update tests to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/554305 [15:26:17] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:26:39] (03CR) 10Bstorm: "Updated it with Arturo's comments. In this case, with an eye to future package-builder changes and process review." [puppet] - 10https://gerrit.wikimedia.org/r/549211 (owner: 10Bstorm) [15:27:09] (03CR) 10Bstorm: toolforge: set a new package builder role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549211 (owner: 10Bstorm) [15:28:38] (03PS5) 10Ottomata: eventgate-analytics - stream config for new sparql-query streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/548764 (https://phabricator.wikimedia.org/T101013) [15:29:34] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - stream config for new sparql-query streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/548764 (https://phabricator.wikimedia.org/T101013) (owner: 10Ottomata) [15:29:49] (03CR) 10Ema: [C: 03+1] Public cache routing for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:29:58] (03CR) 10jerkins-bot: [V: 04-1] admin::tests: update tests to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/554305 (owner: 10Jbond) [15:30:37] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6167 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:30:45] (03CR) 10Bstorm: "Should be a good thing as far as dumps. We also use the puppetization for docker-registry in wmcs and puppetmaster. I don't think this wil" [puppet] - 10https://gerrit.wikimedia.org/r/551396 (https://phabricator.wikimedia.org/T238518) (owner: 10Vgutierrez) [15:31:38] (03CR) 10Ottomata: [C: 03+2] Public cache routing for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:31:57] (03PS2) 10Jbond: admin::tests: update tests to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/554305 [15:32:23] (03PS11) 10Gehel: [wdqs] configure eventgate endpoint for sparql/query events [puppet] - 10https://gerrit.wikimedia.org/r/549081 (https://phabricator.wikimedia.org/T101013) (owner: 10DCausse) [15:33:31] (03CR) 10Gehel: [C: 03+2] [wdqs] configure eventgate endpoint for sparql/query events [puppet] - 10https://gerrit.wikimedia.org/r/549081 (https://phabricator.wikimedia.org/T101013) (owner: 10DCausse) [15:34:09] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.008333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:34:31] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:34:41] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [15:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:57] (03PS1) 10BBlack: switch all prod dns servers to buster [puppet] - 10https://gerrit.wikimedia.org/r/554311 (https://phabricator.wikimedia.org/T239667) [15:37:10] (03CR) 10BBlack: [C: 03+2] switch all prod dns servers to buster [puppet] - 10https://gerrit.wikimedia.org/r/554311 (https://phabricator.wikimedia.org/T239667) (owner: 10BBlack) [15:41:39] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [15:43:50] (03PS1) 10Ottomata: Rename director to eventgate_logging_external [puppet] - 10https://gerrit.wikimedia.org/r/554312 (https://phabricator.wikimedia.org/T236386) [15:44:12] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [15:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:46] (03CR) 10Ema: [C: 03+1] Rename director to eventgate_logging_external [puppet] - 10https://gerrit.wikimedia.org/r/554312 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:45:01] (03CR) 10Ottomata: [C: 03+2] Rename director to eventgate_logging_external [puppet] - 10https://gerrit.wikimedia.org/r/554312 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:48:36] (03PS1) 10Herron: logstash: create elk7 logstash collector profile [puppet] - 10https://gerrit.wikimedia.org/r/554314 (https://phabricator.wikimedia.org/T234854) [15:50:24] (03CR) 10jerkins-bot: [V: 04-1] logstash: create elk7 logstash collector profile [puppet] - 10https://gerrit.wikimedia.org/r/554314 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:51:38] (03PS2) 10Herron: logstash: create elk7 logstash collector profile [puppet] - 10https://gerrit.wikimedia.org/r/554314 (https://phabricator.wikimedia.org/T234854) [15:52:17] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Bstorm) [15:53:45] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.8042 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:54:40] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [15:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:41] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:58:59] (03CR) 10BBlack: [C: 03+2] refactor authdns_servers to include host IPs [puppet] - 10https://gerrit.wikimedia.org/r/554292 (https://phabricator.wikimedia.org/T239711) (owner: 10BBlack) [15:59:42] (03PS5) 10BBlack: authdns: add explicit hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/554293 (https://phabricator.wikimedia.org/T239711) [15:59:50] (03CR) 10BBlack: [C: 03+2] authdns: add explicit hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/554293 (https://phabricator.wikimedia.org/T239711) (owner: 10BBlack) [16:00:09] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:00:12] James_F: Thanks for rolling out the MCR-undo fix, forgot to be around then. [16:00:44] Krinkle: No worries. [16:00:49] (03PS1) 10Jbond: admin: add check to reject privlages which are to lack [puppet] - 10https://gerrit.wikimedia.org/r/554317 (https://phabricator.wikimedia.org/T239070) [16:01:57] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:02:01] (03PS5) 10BBlack: authdns: ferm for ssh hardcoded as well [puppet] - 10https://gerrit.wikimedia.org/r/554294 (https://phabricator.wikimedia.org/T239711) [16:02:26] (03CR) 10BBlack: [C: 03+2] authdns: ferm for ssh hardcoded as well [puppet] - 10https://gerrit.wikimedia.org/r/554294 (https://phabricator.wikimedia.org/T239711) (owner: 10BBlack) [16:02:27] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:02:31] (03CR) 10jerkins-bot: [V: 04-1] admin: add check to reject privlages which are to lack [puppet] - 10https://gerrit.wikimedia.org/r/554317 (https://phabricator.wikimedia.org/T239070) (owner: 10Jbond) [16:03:35] (03CR) 10Muehlenhoff: admin: add check to reject privlages which are to lack (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554317 (https://phabricator.wikimedia.org/T239070) (owner: 10Jbond) [16:04:22] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Bstorm) [16:04:49] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:06:13] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0375 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:06:18] (03PS2) 10Jbond: admin: add check to reject privileges which are to permissive [puppet] - 10https://gerrit.wikimedia.org/r/554317 (https://phabricator.wikimedia.org/T239070) [16:06:34] (03CR) 10Jbond: "thanks updated" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554317 (https://phabricator.wikimedia.org/T239070) (owner: 10Jbond) [16:07:41] (03PS1) 10Ottomata: Route all /produce/logging/* to eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/554318 (https://phabricator.wikimedia.org/T236386) [16:09:15] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:09:47] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6083 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:10:09] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:10:55] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:10:57] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:11:12] jouncebot: now [16:11:13] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [16:11:23] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:11:31] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:12:39] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:14:29] jouncebot: next [16:14:30] In 0 hour(s) and 45 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191203T1700) [16:15:29] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:16:41] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:16:54] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6375 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:19:59] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:20:15] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:23:31] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:24:23] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:25:33] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:26:37] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:27:01] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:27:31] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:28:21] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:28:31] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:28:43] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:30:53] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:30:53] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/MachineVision: Remove slow result randomization from the suggestions query (duration: 01m 03s) [16:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:03] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:32:59] RECOVERY - MegaRAID on ganeti2002 is OK: OK: optimal, 1 logical, 4 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:34:19] (03PS1) 10Ottomata: Import *sparql-query topics and refine them [puppet] - 10https://gerrit.wikimedia.org/r/554327 (https://phabricator.wikimedia.org/T101013) [16:34:23] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) Sigh, it turns out SRE wants us to not rewrite paths. So if we use path based routing, the app needs to handle wh... [16:34:31] (03CR) 10Muehlenhoff: Add CAS authentication to debmonitor (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [16:34:33] (03PS4) 10Muehlenhoff: Add CAS authentication to debmonitor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 [16:35:06] (03PS3) 10Mforns: Remove all references to Wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) [16:36:10] (03CR) 10jerkins-bot: [V: 04-1] Import *sparql-query topics and refine them [puppet] - 10https://gerrit.wikimedia.org/r/554327 (https://phabricator.wikimedia.org/T101013) (owner: 10Ottomata) [16:36:11] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:39:16] (03PS2) 10Ottomata: Import *sparql-query topics and refine them [puppet] - 10https://gerrit.wikimedia.org/r/554327 (https://phabricator.wikimedia.org/T101013) [16:42:32] (03CR) 10Mforns: "As Wikimetrics server has been removed a couple days ago. There's no one using this puppet code any more, and we should be able to merge t" [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [16:42:39] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:42:51] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) How about: - events-logging.wikimedia.org/v1/events - events-analytics.wikimedia.org/v1/events ? Is events-logg... [16:43:51] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:46:07] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 122.3 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [16:46:07] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:46:30] 10Operations, 10Maps: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Mathew.onipe) [16:46:42] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/19761/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/554327 (https://phabricator.wikimedia.org/T101013) (owner: 10Ottomata) [16:47:26] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10Papaul) @akosiaris after removing and putting the failed disk back the disk came back online with no errors ` sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include comp... [16:49:22] 10Operations, 10Puppet, 10Patch-For-Review, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond) >>! In T239334#5708793, @ArielGlenn wrote: > Line length needs to be tweaked to conform with our puppet settings for flake8 I guess. I'm pretty sure the... [16:52:06] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns3002.wikimedia.org', 'dns5002.wikimedia.org'] ` The log can be fo... [16:52:20] !log reimaging dns3002 + dns5002 [16:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:29] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:54:29] PROBLEM - BFD status on cr1-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:54:29] PROBLEM - Host 2001:df2:e500:1:103:102:166:9 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={bird,pdnsrec} site={eqsin,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:55:17] ^bblack due to the reimage? [16:55:17] PROBLEM - Host 2620:0:862:1:91:198:174:62 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:33] (03PS6) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [16:56:03] PROBLEM - Recursive DNS on 103.102.166.9 is CRITICAL: Return code of 255 is out of bounds https://wikitech.wikimedia.org/wiki/DNS [16:56:38] I am not seeing traffic drops so I will assume yes [16:56:50] blerg [16:56:55] yes, reimage [16:57:05] it doesn't auto-downtime all the checks that aren't associated to the hostname, sorry [16:57:11] it is ok [16:57:19] wil go ack all those :) [16:57:25] I was just worried it was a real issue [16:57:47] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6958 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:58:04] there should be a couple more for esams BFD too [16:58:10] and esams recursive [16:58:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:59:01] PROBLEM - Recursive DNS on 91.198.174.62 is CRITICAL: Return code of 255 is out of bounds https://wikitech.wikimedia.org/wiki/DNS [16:59:02] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [16:59:13] ACKNOWLEDGEMENT - Recursive DNS on 103.102.166.9 is CRITICAL: Return code of 255 is out of bounds Brandon Black T239667 reimages https://wikitech.wikimedia.org/wiki/DNS [16:59:13] ACKNOWLEDGEMENT - Recursive DNS on 91.198.174.62 is CRITICAL: Return code of 255 is out of bounds Brandon Black T239667 reimages https://wikitech.wikimedia.org/wiki/DNS [16:59:13] ACKNOWLEDGEMENT - BFD status on cr1-eqsin is CRITICAL: CRIT: Down: 1 Brandon Black T239667 reimages https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:59:13] ACKNOWLEDGEMENT - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 Brandon Black T239667 reimages https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:59:34] it's a little troubling actually that esams BFD isn't showing anything yet [17:00:05] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191203T1700). Please do the needful. [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:06] I will let you debug on your own, given there is no ongoing issue [17:00:11] thanks! [17:00:18] (exceptions doesn't seem important) [17:00:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:00:26] (unrelated mw alert) [17:01:09] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:01:34] (03PS7) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [17:03:33] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:04:57] PROBLEM - Host 2001:df2:e500:1:103:102:166:9 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:57] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0375 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:05:09] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [17:06:54] 10Operations, 10Puppet, 10Patch-For-Review, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond) > Edit: Thinking again you probably meant that black needed to be updated to use 100 char line length (will update) > > Here is another update with line... [17:07:09] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:07:44] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@498c3d1]: repair bulk daemon swift listings [17:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:45] (03PS11) 10Jcrespo: bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) [17:09:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:10:17] ACKNOWLEDGEMENT - Host 2001:df2:e500:1:103:102:166:9 is DOWN: CRITICAL - Destination Unreachable (2001:df2:e500:1:103:102:166:9) Brandon Black T239667 reimages [17:10:17] ACKNOWLEDGEMENT - Host 2620:0:862:1:91:198:174:62 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:1:91:198:174:62) Brandon Black T239667 reimages [17:10:19] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.3 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:10:43] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:10:50] (03CR) 10Jcrespo: [C: 03+2] bacula: Rename schedule to monthlys., split schedule & jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/554257 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [17:11:07] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:12:39] (03PS5) 10Jcrespo: bacula: Setup new pool for databases as well as its configuration [puppet] - 10https://gerrit.wikimedia.org/r/554286 (https://phabricator.wikimedia.org/T238048) [17:13:34] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@498c3d1]: repair bulk daemon swift listings (duration: 05m 49s) [17:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:41] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:14:25] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [17:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:51] (03CR) 10Jcrespo: [C: 03+2] bacula: Setup new pool for databases as well as its configuration [puppet] - 10https://gerrit.wikimedia.org/r/554286 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [17:17:48] (03PS1) 10Jcrespo: Revert "bacula: Setup new pool for databases as well as its configuration" [puppet] - 10https://gerrit.wikimedia.org/r/554338 [17:17:57] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "bacula: Setup new pool for databases as well as its configuration" [puppet] - 10https://gerrit.wikimedia.org/r/554338 (owner: 10Jcrespo) [17:18:40] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:18:42] (03PS1) 10Jcrespo: Revert "Revert "bacula: Setup new pool for databases as well as its configuration"" [puppet] - 10https://gerrit.wikimedia.org/r/554339 [17:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:51] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [17:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:40] (03CR) 10Jcrespo: [C: 04-1] "Needs fixing:" [puppet] - 10https://gerrit.wikimedia.org/r/554339 (owner: 10Jcrespo) [17:21:00] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:22:01] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:53] (03PS1) 10Tpt: enables the Wikisource extension in frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554341 (https://phabricator.wikimedia.org/T239731) [17:24:36] PROBLEM - Host 2620:0:862:1:b226:28ff:fe6e:df40 is DOWN: PING CRITICAL - Packet loss = 100% [17:27:07] (03PS3) 10Herron: logstash: create elk7 logstash collector profile [puppet] - 10https://gerrit.wikimedia.org/r/554314 (https://phabricator.wikimedia.org/T234854) [17:28:22] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10thcipriani) >>! In T239151#5707691, @Dzahn wrote: > Tried to create it but unfortunately: > > > ` > Failure: prerequisites not met for this operation: > error type: insufficient_resources,... [17:28:41] PROBLEM - Recursive DNS on 103.102.166.9 is CRITICAL: Return code of 255 is out of bounds https://wikitech.wikimedia.org/wiki/DNS [17:30:36] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:31:10] (03CR) 10Filippo Giunchedi: "LGTM overall, data in hieradata/ for input_kafka_consumer_id will need adjusting too, together with modules/profile/manifests/prometheus/o" [puppet] - 10https://gerrit.wikimedia.org/r/554314 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [17:31:48] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:32:24] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:32:36] PROBLEM - Host 2001:df2:e500:1:f6e9:d4ff:fed0:7870 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:52] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:33:42] the reporting of temporary ipv6s to icinga for monitoring is annoying :P [17:33:56] but I assume there's a long-term plan to fix that with how we do ipv6 at install-time [17:36:12] PROBLEM - Host 2001:df2:e500:1:f6e9:d4ff:fed0:7870 is DOWN: PING CRITICAL - Packet loss = 100% [17:36:39] (03PS2) 10Jcrespo: Revert "Revert "bacula: Setup new pool for databases as well as its configuration"" [puppet] - 10https://gerrit.wikimedia.org/r/554339 [17:37:00] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:37:50] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:38:25] I think it's the ipv6 issue that causes pdns to fail to start the first time through as well [17:38:28] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:38:28] RECOVERY - BFD status on cr1-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:38:36] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "bacula: Setup new pool for databases as well as its configuration"" [puppet] - 10https://gerrit.wikimedia.org/r/554339 (owner: 10Jcrespo) [17:38:46] 10Operations, 10ops-codfw: codfw: rack/setup/install puppetnaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) [17:38:54] (in that puppet reconfigures the ipv6 to the correct address during the same run it provides the old temporary address as $ipaddress6 to the pdns config, which then fails to start because it can't listen on that) [17:39:16] RECOVERY - Recursive DNS on 103.102.166.9 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [17:39:17] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3002.wikimedia.org', 'dns5002.wikimedia.org'] ` and were **ALL** successful. [17:39:19] 10Operations, 10ops-codfw: codfw: rack/setup/install puppetnaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) p:05Triage→03Normal [17:41:10] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) [17:41:37] (03PS5) 10Jcrespo: database-backups: Use a different pool for database backups [puppet] - 10https://gerrit.wikimedia.org/r/554288 (https://phabricator.wikimedia.org/T238048) [17:41:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:43:00] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:43:34] (03CR) 10Jcrespo: [C: 03+2] database-backups: Use a different pool for database backups [puppet] - 10https://gerrit.wikimedia.org/r/554288 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [17:44:36] 10Operations, 10ops-codfw: codfw: rack/setup/install puppetnaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) [17:45:06] RECOVERY - Recursive DNS on 91.198.174.62 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [17:51:49] (03PS1) 10Jcrespo: database-backups: Fix database jobdefaults for database dumps [puppet] - 10https://gerrit.wikimedia.org/r/554344 (https://phabricator.wikimedia.org/T238048) [17:51:56] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:52:53] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) p:05Triage→03Normal [17:53:25] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.04583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:53:53] (03CR) 10Jcrespo: [C: 03+2] database-backups: Fix database jobdefaults for database dumps [puppet] - 10https://gerrit.wikimedia.org/r/554344 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [17:55:49] (03PS4) 10Herron: logstash: create elk7 logstash collector profile [puppet] - 10https://gerrit.wikimedia.org/r/554314 (https://phabricator.wikimedia.org/T234854) [17:58:15] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) @Jgreen I need some information for this server. - RAID information since we have 6x1.92TB disks - Server name: I am using frdb2002 since we already have frdb2001 Thanks [17:59:41] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:59:43] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5208 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191203T1800). [18:01:58] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) [18:02:09] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) [18:03:14] (03PS2) 10Jforrester: Enable the Wikisource extension on frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554341 (https://phabricator.wikimedia.org/T239731) (owner: 10Tpt) [18:03:39] (03CR) 10Jforrester: [C: 03+1] "Should be good to go whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554341 (https://phabricator.wikimedia.org/T239731) (owner: 10Tpt) [18:08:41] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) Yay! ` Full Backup 10 04-Dec-19 02:05 dbprov2002.codfw.wmnet-Monthly-1st-Wed-Databases-mysql-srv-backups-dumps-latest *unknown* Full... [18:09:14] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) [18:09:59] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:10:08] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) [18:10:42] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10akosiaris) > Do you think we can keep using the server without replacing the disk or do we have to buy 1 disk and keep on site in case the disk failed It's probably gonna fail eventually. We can keep on using... [18:11:39] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/19763/" [puppet] - 10https://gerrit.wikimedia.org/r/554314 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:15:54] (03PS2) 10Cwhite: Revert "Revert "hiera: update ores to pass statsd through statsd_exporter"" [puppet] - 10https://gerrit.wikimedia.org/r/548705 (owner: 10Alexandros Kosiaris) [18:16:19] (03PS3) 10Cwhite: Revert "Revert "hiera: update ores to pass statsd through statsd_exporter"" [puppet] - 10https://gerrit.wikimedia.org/r/548705 (owner: 10Alexandros Kosiaris) [18:16:37] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 93 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:17:05] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:17:27] (03PS4) 10Cwhite: Revert "Revert "hiera: update ores to pass statsd through statsd_exporter"" [puppet] - 10https://gerrit.wikimedia.org/r/548705 (owner: 10Alexandros Kosiaris) [18:17:57] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:18:06] (03CR) 10Herron: "> LGTM overall, data in hieradata/ for input_kafka_consumer_id will" [puppet] - 10https://gerrit.wikimedia.org/r/554314 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:19:13] (03CR) 10Cwhite: [C: 03+2] Revert "Revert "hiera: update ores to pass statsd through statsd_exporter"" [puppet] - 10https://gerrit.wikimedia.org/r/548705 (owner: 10Alexandros Kosiaris) [18:21:31] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:21:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554317 (https://phabricator.wikimedia.org/T239070) (owner: 10Jbond) [18:25:57] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:27:21] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:32:06] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10wiki_willy) [18:33:22] (03CR) 10Herron: [C: 03+2] logstash: create elk7 logstash collector profile [puppet] - 10https://gerrit.wikimedia.org/r/554314 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:36:32] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10wiki_willy) Per Papaul's request, we'll procure the replacement replacement drive ahead of time before it fails again via T239738 . Thanks, Willy [18:39:43] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 94 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:40:11] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:41:09] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:42:53] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:42:53] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:44:17] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10Papaul) 05Open→03Resolved We can resolve this task for now and re-open if that drive fails Thanks [18:44:39] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:45:31] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:50:51] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:53:47] (03PS3) 10Gehel: [wdqs] enable asynchronous imports on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/552836 (https://phabricator.wikimedia.org/T238045) (owner: 10DCausse) [18:56:13] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191203T1900). [19:00:05] thcipriani: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:44] looks like I'm the only scapee, I'll go ahead and deploy myself. [19:00:54] go ahead [19:01:09] thcipriani: go ahead and hand over to me once you're done, thanks! [19:01:21] Urbanecm: ack [19:01:33] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:01:49] what's with the conntrack issues on k8s? [19:02:08] (03CR) 10Thcipriani: [C: 03+2] scap: prep and clean git ops for /srv/patches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526509 (https://phabricator.wikimedia.org/T222240) (owner: 10Thcipriani) [19:03:29] (03CR) 10Gehel: [C: 03+2] [wdqs] enable asynchronous imports on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/552836 (https://phabricator.wikimedia.org/T238045) (owner: 10DCausse) [19:03:41] (03Merged) 10jenkins-bot: scap: prep and clean git ops for /srv/patches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526509 (https://phabricator.wikimedia.org/T222240) (owner: 10Thcipriani) [19:04:56] (03PS1) 10Urbanecm: Add partial blocks for scowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554349 (https://phabricator.wikimedia.org/T239493) [19:06:25] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:07:02] !log thcipriani@deploy1001 Synchronized scap/plugins: [[gerrit:526509|scap: prep and clean git ops for /srv/patches]] T222240 (no-op sync) (duration: 01m 01s) [19:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:09] T222240: scap leaving /srv/patches in a mess - https://phabricator.wikimedia.org/T222240 [19:07:12] Urbanecm: all done [19:07:17] thank you thcipriani [19:08:01] !log reimagine dns1002 + dns2002 - T239667 [19:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:07] T239667: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 [19:08:11] reimage really, but reimagine works too! [19:08:24] hopefully with better downtimes this go-round [19:08:25] (03CR) 10Urbanecm: [C: 03+2] Add partial blocks for scowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554349 (https://phabricator.wikimedia.org/T239493) (owner: 10Urbanecm) [19:08:49] (03PS2) 10Gehel: [wdqs] add async-import option [puppet] - 10https://gerrit.wikimedia.org/r/552835 (https://phabricator.wikimedia.org/T238045) (owner: 10DCausse) [19:09:21] (03Merged) 10jenkins-bot: Add partial blocks for scowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554349 (https://phabricator.wikimedia.org/T239493) (owner: 10Urbanecm) [19:09:24] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=dns2002.wikimedia.org [19:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:30] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=dns1002.wikimedia.org [19:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:59] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:10:13] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns2002.wikimedia.org', 'dns1002.wikimedia.org'] ` The log can be fo... [19:10:19] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2256.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912031909_dzahn_88754_mw225... [19:10:51] (03CR) 10Urbanecm: [C: 03+2] Create translation namespace on nap.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553929 (https://phabricator.wikimedia.org/T239547) (owner: 10DannyS712) [19:10:55] (03PS3) 10Urbanecm: Create translation namespace on nap.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553929 (https://phabricator.wikimedia.org/T239547) (owner: 10DannyS712) [19:11:00] (03CR) 10Urbanecm: [C: 03+2] Create translation namespace on nap.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553929 (https://phabricator.wikimedia.org/T239547) (owner: 10DannyS712) [19:11:22] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2257.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912031911_dzahn_89118_mw225... [19:11:33] (03CR) 10Gehel: [C: 03+2] [wdqs] add async-import option [puppet] - 10https://gerrit.wikimedia.org/r/552835 (https://phabricator.wikimedia.org/T238045) (owner: 10DCausse) [19:12:06] (03Merged) 10jenkins-bot: Create translation namespace on nap.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553929 (https://phabricator.wikimedia.org/T239547) (owner: 10DannyS712) [19:12:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdnsrec site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:12:24] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 45edf5a: Add partial blocks for scowiki (T239493) (duration: 01m 00s) [19:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:30] T239493: Deploy partial block for sco.wiki - https://phabricator.wikimedia.org/T239493 [19:12:53] PROBLEM - Host 2620:0:861:4:208:80:155:108 is DOWN: PING CRITICAL - Packet loss = 100% [19:13:09] PROBLEM - Host 2620:0:860:4:208:80:153:111 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:09] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2258.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912031913_dzahn_89640_mw225... [19:14:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 5c83491: Create translation namespace on nap.wikisource (T239547) (duration: 01m 03s) [19:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:29] T239547: New Translation namespace for nap.wikisource - https://phabricator.wikimedia.org/T239547 [19:14:48] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2259.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912031914_dzahn_89825_mw225... [19:15:31] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5917 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:16:00] !log Morning SWAT done [19:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:05] sbassett: if you want to do T238768, SWAT was just done [19:18:26] (03PS2) 10Dzahn: varnish/ATS: rename director for OTRS from mendelevium to otrs [puppet] - 10https://gerrit.wikimedia.org/r/553423 [19:19:26] (03CR) 10Dzahn: [C: 03+2] varnish/ATS: rename director for OTRS from mendelevium to otrs [puppet] - 10https://gerrit.wikimedia.org/r/553423 (owner: 10Dzahn) [19:20:08] (03PS1) 10DCausse: [wdqs] fix async-import option [puppet] - 10https://gerrit.wikimedia.org/r/554351 [19:22:23] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [19:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:12] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10Mstyles) [19:24:30] (03CR) 10Dzahn: airflow: add a local mariadb server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554215 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [19:24:37] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:56] (03PS3) 10Dzahn: airflow: add a local mariadb server [puppet] - 10https://gerrit.wikimedia.org/r/554215 (https://phabricator.wikimedia.org/T236180) [19:25:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10Gehel) as @Mstyles manager, I approve this access request [19:25:40] (03CR) 10Gehel: [C: 03+2] [wdqs] fix async-import option [puppet] - 10https://gerrit.wikimedia.org/r/554351 (owner: 10DCausse) [19:25:44] (03PS1) 10Ottomata: Import sparql-query events in mediawiki_analytics camus job [puppet] - 10https://gerrit.wikimedia.org/r/554353 (https://phabricator.wikimedia.org/T101013) [19:26:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10Mstyles) email address: mstyles@wikimedia.org wikitech name: mstyles [19:27:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10Gehel) [19:28:19] (03CR) 10EBernhardson: [C: 03+1] Support /entity/ and other Wikidata URLs for Commons [puppet] - 10https://gerrit.wikimedia.org/r/526757 (https://phabricator.wikimedia.org/T222321) (owner: 10Smalyshev) [19:28:43] (03CR) 10Dzahn: airflow: move parameters, use lookup, style changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [19:28:49] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [19:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:57] (03PS7) 10Dzahn: airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) [19:30:27] PROBLEM - Host 2620:0:861:4:d294:66ff:fe5f:6e82 is DOWN: PING CRITICAL - Packet loss = 100% [19:30:58] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:05] ^ more bad ipv6 defs [19:31:41] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:02] (03PS3) 10Dzahn: phabricator: remove wstunnel for aphlict from webserver config [puppet] - 10https://gerrit.wikimedia.org/r/554219 (https://phabricator.wikimedia.org/T238593) [19:32:15] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:33:41] (03PS2) 10Dzahn: design.wikimedia.org: Redirect /compontents to /components/links.html [puppet] - 10https://gerrit.wikimedia.org/r/554231 (https://phabricator.wikimedia.org/T239681) (owner: 10VolkerE) [19:33:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:44] (03CR) 10Volans: "Thanks for the fixes! One typo and one nit between docs and code and we're good to go. An optional suggestion inline too." (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [19:35:20] (03PS3) 10VolkerE: design.wikimedia.org: Redirect /components to /components/links.html [puppet] - 10https://gerrit.wikimedia.org/r/554231 (https://phabricator.wikimedia.org/T239681) [19:35:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:43] (03PS2) 10Ottomata: Import sparql-query events in mediawiki_analytics camus job [puppet] - 10https://gerrit.wikimedia.org/r/554353 (https://phabricator.wikimedia.org/T101013) [19:36:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Import sparql-query events in mediawiki_analytics camus job [puppet] - 10https://gerrit.wikimedia.org/r/554353 (https://phabricator.wikimedia.org/T101013) (owner: 10Ottomata) [19:36:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:49] (03PS1) 10EBernhardson: analaytics meta db: explicit_defaults_for_timestamp=on [puppet] - 10https://gerrit.wikimedia.org/r/554354 (https://phabricator.wikimedia.org/T236180) [19:38:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:44] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:38:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:18] (03PS2) 10EBernhardson: analytics meta db: explicit_defaults_for_timestamp=on [puppet] - 10https://gerrit.wikimedia.org/r/554354 (https://phabricator.wikimedia.org/T236180) [19:39:34] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns1002.wikimedia.org', 'dns2002.wikimedia.org'] ` and were **ALL** successful. [19:41:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:08] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:42:13] (03CR) 10EBernhardson: [C: 03+1] airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [19:42:21] (03PS1) 10Herron: logstash: remove non-kafka inputs from elk7 cluster [puppet] - 10https://gerrit.wikimedia.org/r/554355 (https://phabricator.wikimedia.org/T234854) [19:44:18] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:45:43] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/19764/" [puppet] - 10https://gerrit.wikimedia.org/r/554355 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:46:35] (03CR) 10Herron: [C: 03+2] logstash: remove non-kafka inputs from elk7 cluster [puppet] - 10https://gerrit.wikimedia.org/r/554355 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:49:10] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:49:12] Urbanecm: tx, will deploy soon [19:49:50] (03PS1) 10Ottomata: Use mediawiki_analytics_events, no mediawiki_events camus job for sparql-query [puppet] - 10https://gerrit.wikimedia.org/r/554357 (https://phabricator.wikimedia.org/T101013) [19:52:07] (03PS4) 10Mstyles: add mstyles as user [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) [19:52:27] (03CR) 10jerkins-bot: [V: 04-1] add mstyles as user [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [19:52:39] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10Mstyles) [19:54:37] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@c21a1ca]: Bump preq version for better logging around MW API timeouts [19:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:55] (03CR) 10Dzahn: add mstyles as user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [19:57:52] (03PS5) 10Dzahn: admins: add shell account for mstyles [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [19:58:52] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:59:48] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:00:22] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@c21a1ca]: Bump preq version for better logging around MW API timeouts (duration: 05m 46s) [20:00:23] (03CR) 10Dzahn: [C: 03+1] "Hi Maryum, i fixed the rebase issue and adjusted the UID. The group memberships and key are matching the ticket. lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [20:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10Gehel) [20:01:16] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:02:30] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Add Maryum to Puppet - https://phabricator.wikimedia.org/T239300 (10Gehel) as the manager of the search platform team, and @Mstyles' manager, I approve this request. Since this touches analytics related groups, it probably needs approval f... [20:04:30] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Add Maryum to Puppet - https://phabricator.wikimedia.org/T239300 (10Dzahn) This ticket seems to be a duplicate of T239654. Adding her there as well. Linking Gerrit change i amended to. [20:04:40] (03CR) 10Ottomata: [C: 03+2] Use mediawiki_analytics_events, no mediawiki_events camus job for sparql-query [puppet] - 10https://gerrit.wikimedia.org/r/554357 (https://phabricator.wikimedia.org/T101013) (owner: 10Ottomata) [20:04:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10Dzahn) +@Nuria per T239300#5710194 [20:04:50] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 78 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:05:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10Dzahn) [20:05:48] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Add Maryum to Puppet - https://phabricator.wikimedia.org/T239300 (10Dzahn) [20:06:38] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Add Maryum to Puppet - https://phabricator.wikimedia.org/T239300 (10Dzahn) > I didn't add myself to the users because I wasn't sure what my uid was Your UID is 22524. I adjusted your Gerrit change accordingly. [20:07:17] (03PS6) 10Dzahn: admins: add shell account for mstyles [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [20:08:23] (03CR) 10Dzahn: "could you merge https://gerrit.wikimedia.org/r/c/design/style-guide/+/554356 ?" [puppet] - 10https://gerrit.wikimedia.org/r/554231 (https://phabricator.wikimedia.org/T239681) (owner: 10VolkerE) [20:08:37] (03PS7) 10Mstyles: add mstyles as user [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) [20:09:30] (03CR) 10Mstyles: "> Patch Set 5: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [20:10:23] (03CR) 10Dzahn: [C: 03+2] rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [20:11:32] (03CR) 10Dzahn: "> thanks! I just uploaded something, hope that doesn't conflict. I missed this message before I did the upload" [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [20:13:06] (03PS8) 10Mstyles: add mstyles as user [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) [20:13:48] (03CR) 10Mstyles: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [20:15:30] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.06667 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:17:24] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=dns[12]002.wikimedia.org [20:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:31] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2256.codfw.wmnet'] ` and were **ALL** successful. [20:23:57] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 93 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:24:01] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2257.codfw.wmnet'] ` and were **ALL** successful. [20:26:44] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2258.codfw.wmnet'] ` and were **ALL** successful. [20:28:14] 10Operations, 10Traffic, 10hardware-requests: Add 10G NICs to core site DNS servers (6 servers, 3 per site) - https://phabricator.wikimedia.org/T239675 (10faidon) [20:28:26] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10Jclark-ctr) [20:29:14] (03PS1) 10Cwhite: Revert "hiera: update ores to pass statsd through statsd_exporter" [puppet] - 10https://gerrit.wikimedia.org/r/554360 [20:29:38] (03CR) 10Cwhite: [C: 03+2] Revert "hiera: update ores to pass statsd through statsd_exporter" [puppet] - 10https://gerrit.wikimedia.org/r/554360 (owner: 10Cwhite) [20:32:32] (03PS1) 10Herron: logstash: add kafka ssl_endpoint_identification_algorithm param [puppet] - 10https://gerrit.wikimedia.org/r/554362 (https://phabricator.wikimedia.org/T234854) [20:33:55] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:34:20] (03CR) 10jerkins-bot: [V: 04-1] logstash: add kafka ssl_endpoint_identification_algorithm param [puppet] - 10https://gerrit.wikimedia.org/r/554362 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [20:35:42] (03PS2) 10Herron: logstash: add kafka ssl_endpoint_identification_algorithm param [puppet] - 10https://gerrit.wikimedia.org/r/554362 (https://phabricator.wikimedia.org/T234854) [20:36:47] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:38:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2258.codfw.wmnet [20:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:37] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2256.codfw.wmnet [20:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2257.codfw.wmnet [20:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:53] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6292 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:41:09] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:42:41] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:42:57] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:43:26] (03PS3) 10Herron: logstash: add kafka ssl_endpoint_identification_algorithm param [puppet] - 10https://gerrit.wikimedia.org/r/554362 (https://phabricator.wikimedia.org/T234854) [20:43:32] !log mw2259 - did not come back from reboot after reimage, also mgmt not reachable (T239054) [20:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:37] T239054: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 [20:46:23] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/19766/" [puppet] - 10https://gerrit.wikimedia.org/r/554362 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [20:47:41] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2259.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2259.codfw.wmnet'] ` [20:47:56] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.625 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:48:36] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:50:13] (03CR) 10Herron: [C: 03+2] logstash: add kafka ssl_endpoint_identification_algorithm param [puppet] - 10https://gerrit.wikimedia.org/r/554362 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [20:51:50] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:54:08] Hey all - was going to deploy T238768 security patch to wmf.5 and wmf.8 now [20:55:03] sbassett: Go for it. [20:55:10] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:55:34] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.85 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:55:37] James_F: will do, thanks [20:59:20] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.06667 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [21:00:10] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:01:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:02:58] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:03:03] !log Deployed security patch for T238768 to wmf.5 [21:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:20] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7833 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [21:06:37] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10Nuria) +1 on my end @Mstyles Please read https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines She will also need to be part of wmf... [21:07:14] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:09:20] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [21:09:24] !log Deployed security patch for T238768 to wmf.8 [21:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:58] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:14:33] (03PS3) 10Tpt: Enable the Wikisource extension on frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554341 (https://phabricator.wikimedia.org/T239731) [21:15:58] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [21:18:42] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:19:28] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:31:10] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:34:00] PROBLEM - Host mw2259 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:43] ACKNOWLEDGEMENT - Host mw2259 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T239758 [21:35:33] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2259 and mgmt down - https://phabricator.wikimedia.org/T239758 (10Dzahn) [21:36:15] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2259 and mgmt down - https://phabricator.wikimedia.org/T239758 (10Dzahn) @Papaul Could you take a look at this onsite? [21:36:37] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2259 and mgmt down - https://phabricator.wikimedia.org/T239758 (10Dzahn) p:05Triage→03Normal [21:37:28] In Phabricator the normal priority is now "Medium". It has been renamed. But wikibugs still uses the old term "Normal". [21:38:18] !log volker-e@deploy1001 Started deploy [design/style-guide@02a92f7]: Deploy design/style-guide: [21:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:26] !log volker-e@deploy1001 Finished deploy [design/style-guide@02a92f7]: Deploy design/style-guide: (duration: 00m 07s) [21:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:46] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:39:06] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:39:26] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:40:59] (03CR) 10Dzahn: [C: 04-1] "not needed anymore. already done in https://github.com/wikimedia/WikimediaUI-Style-Guide/pull/312" [puppet] - 10https://gerrit.wikimedia.org/r/554231 (https://phabricator.wikimedia.org/T239681) (owner: 10VolkerE) [21:41:36] 10Operations, 10Wikimedia Design Style Guide: Temporarily forward /design-style-guide/components/ to /design-style-guide/components/links.html - https://phabricator.wikimedia.org/T239681 (10Volker_E) p:05Triage→03Normal [21:42:00] 10Operations, 10Wikimedia Design Style Guide: Temporarily forward /design-style-guide/components/ to /design-style-guide/components/links.html - https://phabricator.wikimedia.org/T239681 (10Volker_E) 05Open→03Resolved Fixed in https://github.com/wikimedia/WikimediaUI-Style-Guide/pull/312 [21:42:40] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 28447 bytes in 6.908 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [21:42:54] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28421 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [21:44:00] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28534 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [21:52:58] (03PS3) 10Jforrester: Pin XML dump schema version at 0.10 for now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552565 (https://phabricator.wikimedia.org/T238921) (owner: 10Daniel Kinzler) [21:53:12] (03CR) 10Jforrester: [C: 03+2] Pin XML dump schema version at 0.10 for now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552565 (https://phabricator.wikimedia.org/T238921) (owner: 10Daniel Kinzler) [21:54:36] (03Merged) 10jenkins-bot: Pin XML dump schema version at 0.10 for now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552565 (https://phabricator.wikimedia.org/T238921) (owner: 10Daniel Kinzler) [21:58:26] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2259 and mgmt down - https://phabricator.wikimedia.org/T239758 (10Dzahn) [21:59:56] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2259 and mgmt down - https://phabricator.wikimedia.org/T239758 (10Dzahn) Also: I noticed in Icinga there is no "mw2259.mgmt" (while for example mw2247.mgmt exists). It's simply not there: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_str... [22:02:07] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [22:03:07] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wgXmlDumpSchemaVersion to 0.1.0 everywhere T238921 T174031 (duration: 01m 03s) [22:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:14] T238921: MCR: Include all slots in XML dumps per default - https://phabricator.wikimedia.org/T238921 [22:03:15] T174031: MCR: Include all slots in XML dumps - https://phabricator.wikimedia.org/T174031 [22:07:46] (03PS8) 10Dzahn: airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) [22:09:59] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10JHedden) 05Open→03Resolved The drives in slots 32:8 and 32:9 are marked as a hot spare now. ` # Check that the drive is unconfigured cloudvirt1... [22:10:01] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10JHedden) [22:10:15] (03PS3) 10Jforrester: Turn off redirect on exact search match for Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) (owner: 10Cparle) [22:10:17] (03PS1) 10Jforrester: Turn off redirect on exact search match for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554372 (https://phabricator.wikimedia.org/T235263) [22:10:56] (03CR) 10jerkins-bot: [V: 04-1] Turn off redirect on exact search match for Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) (owner: 10Cparle) [22:11:07] (03CR) 10Dzahn: [C: 03+2] airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [22:11:45] (03CR) 10jerkins-bot: [V: 04-1] Turn off redirect on exact search match for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554372 (https://phabricator.wikimedia.org/T235263) (owner: 10Jforrester) [22:14:13] (03PS4) 10Jforrester: Turn off redirect on exact search match for Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) (owner: 10Cparle) [22:14:15] (03PS2) 10Jforrester: Turn off redirect on exact search match for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554372 (https://phabricator.wikimedia.org/T235263) [22:15:50] (03CR) 10Jforrester: [C: 03+2] Turn off redirect on exact search match for Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) (owner: 10Cparle) [22:16:10] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [22:16:11] (03PS1) 10Dzahn: airflow: use correct data type for config file name [puppet] - 10https://gerrit.wikimedia.org/r/554373 [22:16:32] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [22:16:41] (03CR) 10Dzahn: [V: 03+2] airflow: use correct data type for config file name [puppet] - 10https://gerrit.wikimedia.org/r/554373 (owner: 10Dzahn) [22:16:48] (03Merged) 10jenkins-bot: Turn off redirect on exact search match for Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) (owner: 10Cparle) [22:16:50] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [22:16:50] (03CR) 10Dzahn: [C: 03+2] airflow: use correct data type for config file name [puppet] - 10https://gerrit.wikimedia.org/r/554373 (owner: 10Dzahn) [22:17:43] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2259 down and mgmt does not exist? - https://phabricator.wikimedia.org/T239758 (10Dzahn) [22:20:19] (03CR) 10Dzahn: "one follow-up and now it works without changes:" [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [22:20:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Jclark-ctr) resent tsr report again waiting on dell [22:21:23] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wmgDoNotRedirectOnSearchMatch, default off, on for Test Commons T235263 (duration: 01m 01s) [22:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:29] T235263: Make it possible to bypass automatic redirection to exact matches in commons - https://phabricator.wikimedia.org/T235263 [22:21:32] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5833 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:22:02] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28431 bytes in 1.623 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:22:10] (03Abandoned) 10Dzahn: design.wikimedia.org: Redirect /components to /components/links.html [puppet] - 10https://gerrit.wikimedia.org/r/554231 (https://phabricator.wikimedia.org/T239681) (owner: 10VolkerE) [22:22:55] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Read wmgDoNotRedirectOnSearchMatch to decide to enable auto-redirect search result change T235263 (duration: 01m 00s) [22:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:08] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28538 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:23:30] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 28535 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:23:36] (03CR) 10Dzahn: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [22:24:23] (03CR) 10Mstyles: "> Patch Set 8: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [22:24:55] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [22:26:12] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:26:37] (03PS9) 10Mstyles: admin: add mstyles as user [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) [22:26:56] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7458 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:27:03] (03CR) 10Mstyles: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [22:27:40] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:28:59] (03PS4) 10Jforrester: Enable the Wikisource extension on frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554341 (https://phabricator.wikimedia.org/T239731) (owner: 10Tpt) [22:29:04] (03CR) 10Jforrester: [C: 03+2] Enable the Wikisource extension on frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554341 (https://phabricator.wikimedia.org/T239731) (owner: 10Tpt) [22:29:28] (03CR) 1020after4: [C: 03+1] phabricator: remove wstunnel for aphlict from webserver config [puppet] - 10https://gerrit.wikimedia.org/r/554219 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [22:30:42] (03Merged) 10jenkins-bot: Enable the Wikisource extension on frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554341 (https://phabricator.wikimedia.org/T239731) (owner: 10Tpt) [22:31:34] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:31:36] (03PS4) 10Jforrester: Remove `wgImportSources` settings for closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552361 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [22:33:00] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:24] (03CR) 10Jforrester: [C: 03+2] Remove `wgImportSources` settings for closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552361 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [22:34:12] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable the Wikisource extension on frwikisource T239731 (duration: 01m 00s) [22:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:18] T239731: Deploy Wikisource extension to frwikisource - https://phabricator.wikimedia.org/T239731 [22:34:30] !log disabled temporarily icinga meta-monitoring (disk full on the wikitech-static host) [22:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:37] (03Merged) 10jenkins-bot: Remove `wgImportSources` settings for closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552361 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [22:35:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 (10Jclark-ctr) Confirmed: Service Request 1005040764 was successfully submitted. ordered replacement drive [22:36:12] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (136828 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [22:36:12] RECOVERY - Wikitech and wt-static content in sync on labweb1002 is OK: wikitech-static OK - wikitech and wikitech-static in sync (136828 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [22:36:33] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove settings for closed wikis T231178 (duration: 01m 01s) [22:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:39] T231178: General cleanup of initialize settings - https://phabricator.wikimedia.org/T231178 [22:43:07] (I've hung up the conch.) [22:43:10] 10Operations, 10Traffic, 10hardware-requests: Add 10G NICs to core site DNS servers (6 servers, 3 per site) - https://phabricator.wikimedia.org/T239675 (10RobH) [22:43:13] 10Operations, 10Traffic, 10hardware-requests: Add 10G NICs to core site DNS servers (6 servers, 3 per site) - https://phabricator.wikimedia.org/T239675 (10RobH) [22:48:30] 10Operations, 10Traffic, 10hardware-requests: Add 10G NICs to core site DNS servers (6 servers, 3 per site) - https://phabricator.wikimedia.org/T239675 (10RobH) 05Open→03Resolved Please note sub-tasks have been created in the private S4 #procurement space, and quotes requested from Dell for these hosts.... [22:52:54] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:53:16] (03PS2) 1020after4: varnish: switch phabricator backend to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552595 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [22:54:47] 10Operations, 10ops-eqiad, 10Analytics: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 (10Jclark-ctr) @elukey unsure if this is same bbu it is diffrent models. 720xd vs 730xd [22:55:26] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:59:07] !log apt-get dist-upgrade and reboot of wikitech-static host [22:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:31] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10RStallman-legalteam) The NDA is fully signed and on file now. Many thanks! [23:14:56] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5792 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [23:17:50] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 79 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:24:54] RECOVERY - Wikitech and wt-static content in sync on cloudweb2001-dev is OK: wikitech-static OK - wikitech and wikitech-static in sync (76298 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [23:29:08] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [23:37:18] RECOVERY - Check size of conntrack table on kubernetes1002 is OK: OK: nf_conntrack is 77 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:42:44] 10Operations, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10JbuattiWMF) Hi @Aklapper, just to confirm what Prateek said, we would need help with both but #1 is more immediate because we... [23:47:31] !log re-enabled meta-monitoring crontabs on wikitech-static after cleanup, reboot and fix wikitech-static's import errors [23:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log