[00:59:32] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:01:24] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [02:30:16] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [03:35:54] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 112.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [04:00:06] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 131.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [04:46:38] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [04:59:40] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 50.85 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [06:40:48] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [07:26:50] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [08:13:44] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 114.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [08:24:24] these alarms are not incredibly problematic, but they indicate that a couple of nodes are spending a ton of time in old GC runs --^ [09:05:32] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [09:09:36] PROBLEM - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2020-04-02 08:53:40 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:44:10] PROBLEM - Check systemd state on cp3050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:28] RECOVERY - Check systemd state on cp3050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:32] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [11:06:28] (03Abandoned) 10MarcoAurelio: offboard-user: Include new security subprojects [puppet] - 10https://gerrit.wikimedia.org/r/576440 (owner: 10MarcoAurelio) [11:18:24] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 131.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [11:30:40] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 69.15 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [11:59:08] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [12:15:48] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [12:37:52] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:01:38] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22410 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:56:28] 10Operations, 10Mail: Wiki email not delievered to GMail - https://phabricator.wikimedia.org/T243937 (10Reedy) [13:56:48] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:06:47] 10Operations, 10Mail: Wiki email not delievered to GMail - https://phabricator.wikimedia.org/T243937 (10Aklapper) @Huji: As there is [sometimes a backlog](https://grafana.wikimedia.org/d/nULM0E1Wk/mailman?orgId=1), did that message ever arrive later? Also, I guess you did check spam folders? [14:07:48] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22407 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:13:32] 10Operations, 10Mail: Wiki email not delievered to GMail - https://phabricator.wikimedia.org/T243937 (10Platonides) I don't think that graph is the right one, André. It may provide approximate data (both are emails), but I think list email is even sent from a completely different relay. Also, @Huji the issue... [14:13:53] /26/101 [14:13:56] 10Operations, 10Mail: Wiki email not delivered to GMail - https://phabricator.wikimedia.org/T243937 (10Platonides) [14:16:37] 10Operations, 10Mail: Wiki email not delivered to GMail - https://phabricator.wikimedia.org/T243937 (10Reedy) FWIW, @Matanya is reporting that another onwiki user isn't getting password reset emails to their gmail account (no bounce records) [14:16:59] 10Operations, 10Mail: Wiki email not delivered to GMail - https://phabricator.wikimedia.org/T243937 (10Aklapper) Thanks. If Yahoo is involved, then quite often Yahoo is the problem (there are quite some tasks about Yahoo Mail here). [14:29:48] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 3 others: esams cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10CDanis) [15:12:41] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584913 (owner: 104nn1l2) [15:34:58] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:36:58] (03PS1) 10Andrew Bogott: Keystone: Drop in a backported fix to token_formatters.py [puppet] - 10https://gerrit.wikimedia.org/r/586105 (https://phabricator.wikimedia.org/T248635) [15:44:29] (03PS2) 10Andrew Bogott: Keystone: Drop in a backported fix to token_formatters.py [puppet] - 10https://gerrit.wikimedia.org/r/586105 (https://phabricator.wikimedia.org/T248635) [15:46:20] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: Drop in a backported fix to token_formatters.py [puppet] - 10https://gerrit.wikimedia.org/r/586105 (https://phabricator.wikimedia.org/T248635) (owner: 10Andrew Bogott) [16:02:40] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22420 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:14:57] (03PS1) 10CDanis: update vhtcpd exporter to match reality [puppet] - 10https://gerrit.wikimedia.org/r/586111 (https://phabricator.wikimedia.org/T249346) [16:19:42] (03PS2) 10CDanis: update vhtcpd exporter to match reality [puppet] - 10https://gerrit.wikimedia.org/r/586111 (https://phabricator.wikimedia.org/T249346) [16:20:07] (03PS3) 10CDanis: update vhtcpd exporter to match reality [puppet] - 10https://gerrit.wikimedia.org/r/586111 (https://phabricator.wikimedia.org/T249346) [16:25:36] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality - https://phabricator.wikimedia.org/T249346 (10CDanis) Updated version parsing your new output above: {P10893} [16:26:48] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality - https://phabricator.wikimedia.org/T249346 (10CDanis) And parsing current output from `cp3052`: {P10895} [16:29:00] (03CR) 10CDanis: [C: 03+2] "I'm self-+2'ing because I want to start tracking the esams backlog ASAP." [puppet] - 10https://gerrit.wikimedia.org/r/586111 (https://phabricator.wikimedia.org/T249346) (owner: 10CDanis) [16:29:18] (03CR) 10CDanis: [C: 03+2] "> Patch Set 3: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/586111 (https://phabricator.wikimedia.org/T249346) (owner: 10CDanis) [16:36:33] 10Operations, 10Domains, 10Traffic, 10WMF-Legal: wikipedia.lol - https://phabricator.wikimedia.org/T88861 (10Dzahn) Ok, fine with me. Thanks! [16:44:10] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [16:45:46] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [16:46:58] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality - https://phabricator.wikimedia.org/T249346 (10CDanis) 05Open→03Resolved a:03CDanis https://grafana.wikimedia.org/d/wBCQKHjWz/vhtcpd?orgId=1&var-datasour... [16:52:32] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: esams cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10CDanis) It isn't just esams that often has a backlog: looking at the past 10-20 min... [17:17:44] 10Operations, 10netops: review fastnetmon thresholds after sensible flow table sizes rollout - https://phabricator.wikimedia.org/T249454 (10CDanis) [17:24:48] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 65.08 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [17:55:11] (03PS1) 10Jhedden: openstack: update nova-placement healthcheck in codfwdev1 [puppet] - 10https://gerrit.wikimedia.org/r/586118 (https://phabricator.wikimedia.org/T249453) [17:58:21] (03CR) 10jerkins-bot: [V: 04-1] openstack: update nova-placement healthcheck in codfwdev1 [puppet] - 10https://gerrit.wikimedia.org/r/586118 (https://phabricator.wikimedia.org/T249453) (owner: 10Jhedden) [18:03:05] (03PS2) 10Jhedden: openstack: update nova-placement healthcheck in codfwdev1 [puppet] - 10https://gerrit.wikimedia.org/r/586118 (https://phabricator.wikimedia.org/T249453) [18:04:07] (03PS3) 10Jhedden: openstack: update nova-placement healthcheck in codfwdev1 [puppet] - 10https://gerrit.wikimedia.org/r/586118 (https://phabricator.wikimedia.org/T249453) [18:34:51] 10Operations, 10MediaWiki-Cache, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 3 others: esams cache_text cluster consistently backlogged on purge requests - https://phabricator.wikimedia.org/T249325 (10bearND) Yes, I think it's more than just esams since the merged in task I created (... [18:37:06] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:49:08] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [18:49:58] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 22409 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:14:38] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:16:28] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:19:38] 10Operations, 10serviceops, 10Continuous-Integration-Config, 10Regression, 10Wikimedia-Incident: operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801 (10Krinkle) [19:26:18] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [19:30:49] (03PS1) 10Andrew Bogott: codf1dev db server: increase max connections by a lot [puppet] - 10https://gerrit.wikimedia.org/r/586135 (https://phabricator.wikimedia.org/T249453) [19:34:22] (03CR) 10Andrew Bogott: [C: 03+2] codf1dev db server: increase max connections by a lot [puppet] - 10https://gerrit.wikimedia.org/r/586135 (https://phabricator.wikimedia.org/T249453) (owner: 10Andrew Bogott) [19:36:26] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 140.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:44:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [19:47:52] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:02:38] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:04:30] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:12:08] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:52] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:13:58] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:19:26] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:25:00] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:28:04] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:28:10] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [20:28:14] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 48.81 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:28:40] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:29:54] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [20:29:54] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:42] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:34:14] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 547 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:35:26] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:40:12] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 38 probes of 547 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:41:36] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:43:26] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:51:21] Someome’s just reported that there’s a significant delay in watchlist notification emails (they’re outlook/hotmail) - could it be linked to T243937 [20:51:21] T243937: Wiki email not delivered to GMail - https://phabricator.wikimedia.org/T243937 [20:54:30] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is OK: HTTP OK: HTTP/1.0 200 OK - 22337 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:58:39] RF1: we could send some test emails [20:58:40] 10Operations, 10Mail: Wiki email not delivered to GMail - https://phabricator.wikimedia.org/T243937 (10Krenair) I just tried to reset password and the emails appeared in gmail immediately. [20:59:02] Krenair is already testing, it seems :P [20:59:16] I just did a one-off password reset attempt [20:59:17] * Krenair shrugs [20:59:20] it looked fine to me [21:00:13] try Special:UserMail me [21:00:16] It’s the first report, left them a link to that task. [21:00:40] weird that they claim outlook is slow as well [21:01:02] it could be something on wmf side... [21:01:13] or simply people that is unable to look at the spam folfer [21:01:15] *folder [21:01:28] Could be [21:01:43] Emails from the mailing lists seem fine [21:02:00] I think those use a different relay [21:02:21] But I only get daily emails for notifications as I’d get one every few seconds otherwise [21:02:55] I'm not even sure how I have notifications set up [21:05:47] Platonides, did you get an email? [21:05:59] I got a sender's copy [21:06:11] nope [21:06:18] * Platonides digs deeper [21:06:38] it's there now [21:06:46] 21:05 (0 minutes ago) [21:06:54] fast enough for email [21:07:03] LGTM [21:07:56] 0 seconds of delay according to Received: headers [21:08:54] so it's pretty good :P [21:25:24] PROBLEM - Host mw2323 is DOWN: PING CRITICAL - Packet loss = 100% [21:28:00] RECOVERY - Host mw2323 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [21:53:59] (03PS1) 10Riley: Update ti.wiki logo from English version to version located at [[File:Wikipedia-logo-v2-ti.svg]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586149 [21:54:01] (03PS1) 10Legoktm: planet: Add new techblog.wikimedia.org feed [puppet] - 10https://gerrit.wikimedia.org/r/586147 [21:54:03] (03PS1) 10Legoktm: planet: Remove dead blog.wikimedia.org feeds [puppet] - 10https://gerrit.wikimedia.org/r/586148 [21:54:05] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586149 (owner: 10Riley) [21:54:19] thats excessive [21:54:44] we can take back the welcome if you'd like ;) [21:55:11] (03CR) 10jerkins-bot: [V: 04-1] planet: Remove dead blog.wikimedia.org feeds [puppet] - 10https://gerrit.wikimedia.org/r/586148 (owner: 10Legoktm) [21:55:50] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:56:19] (03PS2) 10Legoktm: planet: Remove dead blog.wikimedia.org feeds [puppet] - 10https://gerrit.wikimedia.org/r/586148 [21:56:44] riley ohh, you want to upload an image? [21:57:17] paladox: Yes, for two wikis [21:57:19] legoktm: LOL [21:57:47] Ok, you'll need to use git on the commandline (file upload support will be coming in gerrit 3.2). [21:57:52] riley do you have git installed? [21:58:24] (03Abandoned) 10Riley: Update ti.wiki logo from English version to version located at [[File:Wikipedia-logo-v2-ti.svg]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/586149 (owner: 10Riley) [21:58:38] I do, just need to update SSH [21:59:26] You add your ssh key to https://gerrit.wikimedia.org/r/#/settings/ssh-keys [21:59:49] riley have you created ~/.gitconfig too? [22:03:22] Yup [22:03:25] And now done for ssh key [22:37:34] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 109.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [22:46:00] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 16.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [22:46:14] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:47:58] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:56:48] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [22:58:34] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [23:03:54] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [23:06:02] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [23:11:18] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [23:11:34] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [23:14:44] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:20:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:22:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:25:42] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1075 is OK: HTTP OK: HTTP/1.0 200 OK - 22336 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:57:32] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37