[00:00:21] so they are just sitting there, not blocked [00:00:23] it does seem like it's leaving threads hanging around. I don't see active things running in javamelody or in show-queue, but lots of threads in activeThreads in melody and httpmaxtimes going up [00:01:11] yeah, nothing blocked [00:01:14] https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvNC8tLWpzdGFjay0xOS0wNS0wMy0yMy0zOC00OS5kdW1wLS0wLTAtNTA= [00:01:31] ^ is a threaddump I took a few minutes before restart [00:03:21] according to google, the TIMED_WAITING threads are normal threads in the pool [00:03:38] most of them seem to be those... maybe it's just ... letting the thread pool grow on purpose or something [00:05:45] it's possible, although it has coincided with some lag in the UI. Increased httpmaxtimes. It's been less severe since the move to g1gc, but I didn't want to let it sit on a Friday. [00:06:12] that's reasonable [00:06:20] (03PS1) 10CRusnov: profile::netbox: Move ganeti sync config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/508067 [00:06:48] anyway, I think I will actually now back away from refreshing graphs for a few hours at least :) [00:07:03] have a good weekend [00:07:24] you too, feel free to ping me if you need any hands on this [00:07:44] will do. appreciated :) [00:07:57] :) [00:28:10] i think the higher threads is related to the icmp/error on https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=20&fullscreen&orgId=1&var-server=cobalt&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc (times kind of match) [00:28:35] threads just went up. [00:31:24] actually, the times do match. [00:31:39] at least threads went up at the same time as the icmp/error a few mins ago [00:38:38] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10serviceops: cobalt is experiencing frequent icmp/errs causing high threads in gerrit - https://phabricator.wikimedia.org/T222498 (10Paladox) [00:45:08] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10serviceops: cobalt is experiencing frequent icmp/errs causing high threads in gerrit - https://phabricator.wikimedia.org/T222498 (10Paladox) p:05Triage→03High [01:49:49] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:50:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:52:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:52:27] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:21:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:29:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:10:53] PROBLEM - puppet last run on cloudvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:42:45] RECOVERY - puppet last run on cloudvirt1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:06:21] PROBLEM - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,2 instance=db1069:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad+prometheus/ops [05:50:51] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:17:00] 10Operations, 10DBA: SMART alerts on db1069 - https://phabricator.wikimedia.org/T222507 (10jijiki) [06:17:25] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:18:00] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,2 instance=db1069:9100 job=node site=eqiad Effie Mouzeli Smart error - T222507 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad+prometheus/ops [06:29:07] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:30:41] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:38:53] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync [06:44:57] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync [08:15:33] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10serviceops: cobalt is experiencing frequent icmp/errs causing high threads in gerrit - https://phabricator.wikimedia.org/T222498 (10hashar) 05Open→03Invalid Essentially it is the other way around ;) The TCP errors are due to client connections being... [08:38:01] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 346 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Netbox [08:51:35] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [08:52:55] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync [08:58:59] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync [10:34:25] (03PS1) 10ArielGlenn: reduce sleep time between wikis for adds-changes dumps [dumps] - 10https://gerrit.wikimedia.org/r/508070 [10:36:08] (03CR) 10ArielGlenn: [C: 03+2] convert exception error strings to utf8, thanks a lot python3 [dumps] - 10https://gerrit.wikimedia.org/r/507977 (owner: 10ArielGlenn) [10:36:19] (03CR) 10ArielGlenn: [C: 03+2] make page ranges for stubs be ints [dumps] - 10https://gerrit.wikimedia.org/r/507976 (owner: 10ArielGlenn) [10:37:05] (03CR) 10ArielGlenn: [C: 03+2] reduce sleep time between wikis for adds-changes dumps [dumps] - 10https://gerrit.wikimedia.org/r/508070 (owner: 10ArielGlenn) [10:38:24] !log ariel@deploy1001 Started deploy [dumps/dumps@26b52ef]: misc small fixes, reduce sleep time for incr wikis [10:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:33] !log ariel@deploy1001 Finished deploy [dumps/dumps@26b52ef]: misc small fixes, reduce sleep time for incr wikis (duration: 00m 09s) [10:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:35] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:37:05] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:07:41] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:15:37] (03PS1) 10Volans: Drop support for Python 3.4 [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 [14:15:39] (03PS1) 10Volans: tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 [14:15:41] (03PS1) 10Volans: flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 [14:15:43] (03PS1) 10Volans: documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 [14:16:17] (03CR) 10jerkins-bot: [V: 04-1] flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 (owner: 10Volans) [14:16:19] (03CR) 10jerkins-bot: [V: 04-1] tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [14:16:21] (03CR) 10jerkins-bot: [V: 04-1] documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 (owner: 10Volans) [14:25:10] (03PS2) 10Volans: tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 [14:25:12] (03PS2) 10Volans: flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 [14:25:14] (03PS2) 10Volans: documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 [14:25:39] (03CR) 10jerkins-bot: [V: 04-1] documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 (owner: 10Volans) [14:25:41] (03CR) 10jerkins-bot: [V: 04-1] flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 (owner: 10Volans) [14:25:45] (03CR) 10jerkins-bot: [V: 04-1] tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [14:33:03] (03CR) 10Volans: "Blocked by https://phabricator.wikimedia.org/T222512" [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [14:34:15] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:47:37] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:11:25] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:12:01] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:14:07] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:17:51] PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:24:35] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:25:11] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:44:21] RECOVERY - puppet last run on mc1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:19] (03PS1) 10Ammarpad: Enable Page Previews as default on hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) [16:58:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:04:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:17:30] (03PS1) 10Reedy: Add flow, lsearch and sitelist to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508100 [22:18:04] (03CR) 10Reedy: [C: 03+2] Add flow, lsearch and sitelist to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508100 (owner: 10Reedy) [22:19:06] (03Merged) 10jenkins-bot: Add flow, lsearch and sitelist to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508100 (owner: 10Reedy) [22:19:20] (03CR) 10jenkins-bot: Add flow, lsearch and sitelist to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508100 (owner: 10Reedy) [22:20:42] !log reedy@deploy1001 Synchronized docroot/mediawiki/xml/index.html: Add extra xml namespace links (duration: 01m 06s) [22:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:33] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:26:37] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:26:37] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:26:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:26:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:26:59] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:27:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:27:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:27:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:27:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [23:27:51] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [23:27:57] that doesn't sound good [23:28:16] what's up [23:28:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [23:28:25] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [23:28:35] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:29:03] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [23:29:49] chaomodus, possibly the same issue again? [23:29:55] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [23:30:00] user report at https://phabricator.wikimedia.org/T222418#5157744 [23:30:03] yah looks the same [23:30:09] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:30:12] same symptoms anyways [23:30:24] that mailbox lag thing [23:32:33] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:33:02] https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now [23:33:09] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:33:11] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:33:12] 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Samwilson) [23:33:13] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:33:29] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:33:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:33:35] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:33:39] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Krenair) [23:33:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:33:47] O.o Krenair https://en.wikipedia.org/w/index.php?title=Draft:Lassana_N%27Diaye&diff=895537723&oldid=895536904&diffmode=source [23:33:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:33:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:33:56] transient [23:34:04] substitution is broken [23:34:05] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:34:13] backends look like they recovered [23:34:18] err, the graphs i mean. [23:34:21] ToBeFree, isn't there a way for users to trigger that behaviour? [23:34:32] possibly but the user asks for help and has no idea [23:34:45] this may be a distraction from the issue at hand [23:34:51] I suggest putting a task in [23:35:09] I thought the timing may explain a connection. hm [23:35:20] I got 500 errors on userpages at the same time [23:35:49] ah well, it's likely fixed now, I just thought there is an explanation for this as a known issue or something like that [23:35:54] * ToBeFree shrugs [23:36:23] ToBeFree: It is caused by an unclosed HTML comment. [23:36:46] https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-48h&to=now [23:36:57] does not look as bad as when this happened two days ago [23:37:33] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [23:37:45] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [23:37:55] gaah [23:38:09] JJMC89: thank you so much, that would have taken a while for me to debug [23:38:22] I was about to re-do the edit [23:38:55] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [23:39:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [23:39:31] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [23:44:33] chaomodus, I notice the Varnish backend number of objects graph in particular - servers just suddenly dropping to 0? [23:44:36] are they restarting? [23:45:05] idkx I didnv't restart them, but it's psosible [23:47:33] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [23:58:56] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Krenair) possible continuation of https://wikitech.wikimedia.org/wiki/Incident_documentation/20190503-varnish