[00:12:04] 10Operations, 10Wikimedia-Mailing-lists: New WikiJournal_CoC@lists.wikimedia.org - https://phabricator.wikimedia.org/T223861 (10Aklapper) 05Open→03Resolved Unfortunately no reply by @Thomas_Shafee, hence assuming everything is fine. [00:21:02] 10Operations, 10observability, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10Aklapper) [00:24:05] 10Operations, 10observability, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10Aklapper) [01:31:41] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:32:59] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 75918 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:01:07] 10Operations, 10SRE-Access-Requests: Requesting access to deployment cluster for awight - https://phabricator.wikimedia.org/T225062 (10greg) Approved on my side. Thanks Adam! [05:02:26] (03PS1) 10ArielGlenn: revoke dzahn's ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/515988 (https://phabricator.wikimedia.org/T225371) [05:03:23] (03CR) 10jerkins-bot: [V: 04-1] revoke dzahn's ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/515988 (https://phabricator.wikimedia.org/T225371) (owner: 10ArielGlenn) [05:05:19] (03PS2) 10ArielGlenn: revoke dzahn's ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/515988 (https://phabricator.wikimedia.org/T225371) [05:06:39] (03CR) 10ArielGlenn: [C: 03+2] revoke dzahn's ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/515988 (https://phabricator.wikimedia.org/T225371) (owner: 10ArielGlenn) [05:21:05] PROBLEM - puppet last run on mwmaint1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): User[dzahn] [05:32:44] ^^ handled [05:48:19] RECOVERY - puppet last run on mwmaint1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:29:27] PROBLEM - puppet last run on restbase1023 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:30:39] PROBLEM - puppet last run on logstash1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:33:35] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/safe-service-restart] [06:38:07] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [06:45:25] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [06:56:41] RECOVERY - puppet last run on restbase1023 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:57] RECOVERY - puppet last run on logstash1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:47] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [08:32:35] RECOVERY - Host lvs4007 is UP: PING OK - Packet loss = 0%, RTA = 74.31 ms [09:26:51] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [09:29:43] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [10:40:36] (03PS1) 10Andrew Bogott: Removed dzahn's root keys [labs/private] - 10https://gerrit.wikimedia.org/r/516049 (https://phabricator.wikimedia.org/T225371) [10:41:24] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Removed dzahn's root keys [labs/private] - 10https://gerrit.wikimedia.org/r/516049 (https://phabricator.wikimedia.org/T225371) (owner: 10Andrew Bogott) [11:41:45] RECOVERY - MariaDB Slave Lag: s6 on db2097 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:52:40] (03PS1) 10Gergő Tisza: Fix import group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516053 [12:06:34] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS internal servers started lagging behind - https://phabricator.wikimedia.org/T224829 (10Mathew.onipe) 05Open→03Invalid [14:36:54] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10Marostegui) Looks like it might be related to a HW memory issue that has been going on for a few days: ` /system1/log1/record16 Targets... [14:40:23] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Tgr) [14:56:42] (03PS1) 10Gergő Tisza: Add .well-known/matrix for wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516055 (https://phabricator.wikimedia.org/T223835) [14:58:44] (03PS1) 10Gergő Tisza: Revert "Matrix wikimedia.org IDs domain authorization" [dns] - 10https://gerrit.wikimedia.org/r/516056 (https://phabricator.wikimedia.org/T223835) [15:03:34] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:01:24] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [16:11:29] (03CR) 10Lucas Werkmeister (WMDE): dologmsg: move this little script out of toolforge profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [16:48:29] 10Operations, 10Continuous-Integration-Infrastructure, 10MediaWiki-Core-Testing, 10HHVM: Re-add complete URL parsing fix from 3.18.7 release - https://phabricator.wikimedia.org/T185024 (10Krinkle) [19:02:25] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Documentation, 10Release-Engineering-Team (Next): TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10thcipriani) [19:02:30] 10Operations, 10Prod-Kubernetes, 10Documentation, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Next): Update Blubber documentation - https://phabricator.wikimedia.org/T213198 (10thcipriani) 05Open→03Resolved a:03thcipriani >>! In T213198#4961595, @LarsWirzenius wrote: > https://wikitec... [19:36:57] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [19:58:09] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [20:04:05] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [20:30:47] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [20:42:45] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:44:05] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:00:35] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:27:45] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:43:07] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:47:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:56:07] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received: / (spec from root) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:01:34] PROBLEM - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:01:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:06:15] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:08:38] RECOVERY - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 138 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:09:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid