[00:00:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1403.eqiad.wmnet with reason: REIMAGE [00:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:12] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [00:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:33] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:08:09] !log T267927 Reload of `wdqs2003` complete [00:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:59] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, relocating_shards: 0, status: green, number_of_in_flight_fetch: 0, timed_out: False, delayed_unassigned_shards: 0, active_shards_percent_as_number: 100.0, number_of_data_nodes: 6, active_primary_shards: 895, number_of_nodes: 6, task_max_waiting_in_queue_millis: [00:08:59] ards: 0, active_shards: 1793, initializing_shards: 0, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:09:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1307.eqiad.wmnet with reason: REIMAGE [00:09:11] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [00:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1307.eqiad.wmnet with reason: REIMAGE [00:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:34] PROBLEM - puppet last run on wdqs2008 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:14:30] PROBLEM - puppet last run on wdqs2003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:14:54] !log T267927 `sudo run-puppet-agent` and `sudo pool` on `wdqs2003` [00:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:02] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [00:16:30] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1403.eqiad.wmnet'] ` and were **ALL** s... [00:16:59] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1402.eqiad.wmnet'] ` and were **ALL** s... [00:17:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 108 probes of 622 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:18:47] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1402.eqiad.wmnet [00:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:56] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1403.eqiad.wmnet [00:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:56] RECOVERY - puppet last run on wdqs2003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:20:31] (03PS2) 10Ryan Kemper: wdqs: int can't take in float as string [cookbooks] - 10https://gerrit.wikimedia.org/r/680095 (https://phabricator.wikimedia.org/T280108) [00:22:51] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1403.eqiad.wmnet [00:22:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 51 probes of 622 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:38] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1402.eqiad.wmnet [00:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:11] RECOVERY - puppet last run on wdqs2008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:32:15] (03CR) 10Ryan Kemper: wdqs: int can't take in float as string (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/680095 (https://phabricator.wikimedia.org/T280108) (owner: 10Ryan Kemper) [00:48:48] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1307.eqiad.wmnet [00:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:48] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1307.eqiad.wmnet'] ` and were **ALL** s... [00:53:20] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1307.eqiad.wmnet [00:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:44] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [03:17:12] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:44:00] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [03:50:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_netflow_failure_flags.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:26] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [04:02:56] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [04:08:40] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [04:10:06] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [04:10:52] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [04:28:56] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [04:31:10] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_shards: 1793, initializing_shards: 0, delayed_unassigned_shards: 0, unassigned_shards: 0, relocating_shards: 0, active_primary_shards: 895, task_max_waiting_in_queue_millis: 0, cluster_name: cloudelastic-chi-eqiad, status: green, number_of_data_nodes: 6, number_of_pending_tasks: 0, number_of [04:31:10] 0, number_of_nodes: 6, active_shards_percent_as_number: 100.0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [05:11:58] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:14:14] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: task_max_waiting_in_queue_millis: 0, active_primary_shards: 895, active_shards: 1793, number_of_in_flight_fetch: 0, number_of_pending_tasks: 0, active_shards_percent_as_number: 100.0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_nodes: 6, number_of_data_nodes: 6, initializing_shard [05:14:14] e: cloudelastic-chi-eqiad, relocating_shards: 0, status: green, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [05:27:18] 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10NFSL2001) [05:27:58] 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10NFSL2001) [05:40:04] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for ptwikinews - https://phabricator.wikimedia.org/T280408 (10Ladsgroup) Hi, if you can wait for two to three weeks, we will get mailman3 up and running soon and you can enjoy a much more modern system (see https://lists-next.wikimedia.org). This is by no... [06:38:38] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:40:54] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_data_nodes: 6, number_of_pending_tasks: 0, status: green, unassigned_shards: 0, cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_number: 100.0, number_of_nodes: 6, task_max_waiting_in_queue_millis: 0, active_primary_shards: 895, number_of_in_flight_fetch: 0, relocating_shard [06:40:54] ssigned_shards: 0, active_shards: 1793, initializing_shards: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [07:19:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:40] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [07:24:26] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_primary_shards: 895, relocating_shards: 0, cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, timed_out: False, number_of_in_flight_fetch: 0, number_of_nodes: 6, active_shards_percent_as_number: 100.0, initializing_shards: 0, unassigned_shards: 0, [07:24:26] ctive_shards: 1793, number_of_data_nodes: 6, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:33:00] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=misc file=puppet_agent.prom instance=otrs1001 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [08:05:34] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [08:07:50] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: delayed_unassigned_shards: 0, active_shards: 1793, task_max_waiting_in_queue_millis: 0, cluster_name: cloudelastic-chi-eqiad, number_of_nodes: 6, timed_out: False, relocating_shards: 0, initializing_shards: 0, number_of_pending_tasks: 0, unassigned_shards: 0, active_shards_percent_as_number: 100.0, [08:07:50] umber_of_data_nodes: 6, number_of_in_flight_fetch: 0, active_primary_shards: 895 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:09:20] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [08:14:08] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [09:13:23] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) So after several changes in puppetmaster of mailman in the cloud, it works now: https://polymorphic.lists.wmcloud.org/hyperkitty/hyperkit... [09:39:16] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [09:41:34] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [10:07:46] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:24:57] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Schlurcher) >> From my talk page a suggestion was to check maxlag. I'm checking maxlag, and at the time I ultimately shut down the bot, maxlag event... [10:27:29] 10SRE, 10MediaWiki-General, 10Browser-Support-Apple-Safari: File:Chessboard480.svg not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10Daimona) [10:37:04] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [10:41:52] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: status: green, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, initializing_shards: 0, active_shards: 1793, number_of_in_flight_fetch: 0, number_of_data_nodes: 6, unassigned_shards: 0, delayed_unassigned_shards: 0, cluster_name: cloudelastic-chi-eqiad, active_primary_shards: 895, r [10:41:52] 0, timed_out: False, number_of_nodes: 6, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:03:28] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for ptwikinews - https://phabricator.wikimedia.org/T280408 (10Edu) @Ladsgroup I'll wait. Thanks. [12:10:12] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:10] RECOVERY - Host elastic2043 is UP: PING WARNING - Packet loss = 66%, RTA = 31.89 ms [12:18:16] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [12:20:34] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: number_of_pending_tasks: 0, timed_out: False, status: green, task_max_waiting_in_queue_millis: 0, initializing_shards: 0, active_shards: 1793, active_shards_percent_as_number: 100.0, number_of_nodes: 6, number_of_data_nodes: 6, relocating_shards: 0, unassigned_shards: 0, cluster_name: cloudelastic- [12:20:34] _primary_shards: 895, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:10:59] (03CR) 10Volans: "I might be missing context here but it seems to me that it would be much simpler to just write a cookbook (or modify the existing downtime" [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [14:36:58] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:39:16] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: timed_out: False, unassigned_shards: 0, active_shards: 1793, relocating_shards: 0, active_primary_shards: 895, active_shards_percent_as_number: 100.0, cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, status: green, number_ [14:39:16] ializing_shards: 0, number_of_data_nodes: 6, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:06:34] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [15:08:48] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [15:20:16] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.7937 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:21:44] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [15:22:26] PROBLEM - Mediawiki CirrusSearch pool counter rejections rate on alert1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Pool_Counter_rejections_%28search_is_currently_too_busy%29 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1 [15:27:12] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1060-production-search-eqiad on elastic1060 is CRITICAL: 288.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-eqiad&var-instance=elastic1060&panelId=37 [15:32:22] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:36:07] (03PS1) 10Daimona Eaytoy: Stop setting $wgAbuseFilterParserClass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680753 (https://phabricator.wikimedia.org/T239990) [15:38:40] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [15:41:46] RECOVERY - Mediawiki CirrusSearch pool counter rejections rate on alert1001 is OK: OK: Less than 1.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Pool_Counter_rejections_%28search_is_currently_too_busy%29 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1 [16:03:50] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [16:06:08] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: timed_out: False, number_of_pending_tasks: 0, cluster_name: cloudelastic-chi-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, number_of_data_nodes: 6, number_of_in_flight_fetch: 0, initializing_shards: 0, number_of_nodes: 6, status: green, task_max_waiting_in_queue_millis: 0, ac [16:06:08] , delayed_unassigned_shards: 0, unassigned_shards: 0, active_primary_shards: 895 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:06:44] Hi, I want to add my tool in CI on gerrit. I known I have to addd config in zuul/layout.yaml. In the past, I only added tox-docker to my python app. This time, I am using React that needs `yarn test` or `npm run test`. So what template name should I use? [16:08:28] RECOVERY - Memory correctable errors -EDAC- on thumbor2001 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor2001&var-datasource=codfw+prometheus/ops [16:16:22] !log cleaning SuccuBot's watchlist in wikidatawiki [16:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:16] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1060-production-search-eqiad on elastic1060 is OK: (C)100 gt (W)80 gt 61.02 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-eqiad&var-instance=elastic1060&panelId=37 [16:42:20] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for ptwikinews - https://phabricator.wikimedia.org/T280408 (10Ladsgroup) Thanks. [16:49:32] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 69 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:55:46] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [16:55:58] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 51 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:58:00] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [18:03:46] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [18:06:04] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: status: green, active_shards_percent_as_number: 100.0, number_of_data_nodes: 6, active_shards: 1793, active_primary_shards: 895, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, timed_out: False, number_of_nodes: 6, number_of_pending_tasks: 0, relocating_shards: 0, number [18:06:04] ch: 0, cluster_name: cloudelastic-chi-eqiad, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:49:10] (03CR) 10Tacsipacsi: "Shouldn’t this patchset be tagged with T273317 and T275322? gerritbot doesn’t follow `Depends-On`, so there are no notifications on Phabri" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza) [20:14:14] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:19:02] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: status: green, cluster_name: cloudelastic-chi-eqiad, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, active_shards: 1793, unassigned_shards: 0, active_primary_shards: 895, delayed_unassigned_shards: 0, initializing_shards: 0, active_shards_percent_as_number: 100.0, number_of_in_flight_fe [20:19:02] _nodes: 6, number_of_pending_tasks: 0, number_of_data_nodes: 6, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [20:37:28] (03CR) 10Jforrester: [C: 03+1] "Ha, we could have removed the IS setting a while ago. Oops." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680753 (https://phabricator.wikimedia.org/T239990) (owner: 10Daimona Eaytoy) [20:39:26] (03CR) 10Gergő Tisza: "> Shouldn’t this patchset be tagged with T273317 and T275322? gerritbot doesn’t follow `Depends-On`, so there are no notifications on Phab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza) [21:19:06] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 307.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [21:56:56] (03CR) 10Tacsipacsi: "Yes, there are quite a number of FlaggedRevs tickets, probably a number of them having the same root cause. I’m also lost, but that’s what" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza) [22:08:58] (03PS7) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [22:15:42] (03PS8) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [22:19:10] (03PS9) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [22:43:18] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:53:02] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_primary_shards: 895, task_max_waiting_in_queue_millis: 0, active_shards: 1793, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, number_of_nodes: 6, relocating_shards: 0, timed_out: False, initializing_shards: 0, cluster_name: cloudelastic-chi-eqiad, status: green, number_of_pendin [22:53:02] r_of_data_nodes: 6, active_shards_percent_as_number: 100.0, unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration