[00:00:21] <chaomodus>	 so they are just sitting there, not blocked
[00:00:23] <thcipriani>	 it does seem like it's leaving threads hanging around. I don't see active things running in javamelody or in show-queue, but lots of threads in activeThreads in melody and httpmaxtimes going up
[00:01:11] <thcipriani>	 yeah, nothing blocked
[00:01:14] <thcipriani>	 https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvNC8tLWpzdGFjay0xOS0wNS0wMy0yMy0zOC00OS5kdW1wLS0wLTAtNTA=
[00:01:31] <thcipriani>	 ^ is a threaddump I took a few minutes before restart
[00:03:21] <chaomodus>	 according to google, the TIMED_WAITING threads are normal threads in the pool
[00:03:38] <chaomodus>	 most of them seem to be those... maybe it's just ... letting the thread pool grow on purpose or something
[00:05:45] <thcipriani>	 it's possible, although it has coincided with some lag in the UI. Increased httpmaxtimes. It's been less severe since the move to g1gc, but I didn't want to let it sit on a Friday.
[00:06:12] <chaomodus>	 that's reasonable
[00:06:20] <wikibugs>	 (03PS1) 10CRusnov: profile::netbox: Move ganeti sync config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/508067
[00:06:48] <thcipriani>	 anyway, I think I will actually now back away from refreshing graphs for a few hours at least :)
[00:07:03] <thcipriani>	 have a good weekend
[00:07:24] <chaomodus>	 you too, feel free to ping me if you need any hands on this 
[00:07:44] <thcipriani>	 will do. appreciated :)
[00:07:57] <chaomodus>	 :)
[00:28:10] <paladox>	 i think the higher threads is related to the icmp/error on https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=20&fullscreen&orgId=1&var-server=cobalt&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc (times kind of match)
[00:28:35] <paladox>	 threads just went up.
[00:31:24] <paladox>	 actually, the times do match.
[00:31:39] <paladox>	 at least threads went up at the same time as the icmp/error a few mins ago
[00:38:38] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team, 10serviceops: cobalt is experiencing frequent icmp/errs causing high threads in gerrit - https://phabricator.wikimedia.org/T222498 (10Paladox)
[00:45:08] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team, 10serviceops: cobalt is experiencing frequent icmp/errs causing high threads in gerrit - https://phabricator.wikimedia.org/T222498 (10Paladox) p:05Triage→03High
[01:49:49] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:50:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:52:03] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:52:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:21:13] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[02:29:05] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[04:10:53] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:42:45] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[05:06:21] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,2 instance=db1069:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad+prometheus/ops
[05:50:51] <icinga-wm>	 PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:17:00] <wikibugs>	 10Operations, 10DBA: SMART alerts on db1069 - https://phabricator.wikimedia.org/T222507 (10jijiki)
[06:17:25] <icinga-wm>	 RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:18:00] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,2 instance=db1069:9100 job=node site=eqiad Effie Mouzeli Smart error - T222507 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad+prometheus/ops
[06:29:07] <icinga-wm>	 PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Netbox
[06:30:41] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:38:53] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync
[06:44:57] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync
[08:15:33] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team, 10serviceops: cobalt is experiencing frequent icmp/errs causing high threads in gerrit - https://phabricator.wikimedia.org/T222498 (10hashar) 05Open→03Invalid Essentially it is the other way around ;) The TCP errors are due to client connections being...
[08:38:01] <icinga-wm>	 RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 346 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Netbox
[08:51:35] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational
[08:52:55] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync
[08:58:59] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync
[10:34:25] <wikibugs>	 (03PS1) 10ArielGlenn: reduce sleep time between wikis for adds-changes dumps [dumps] - 10https://gerrit.wikimedia.org/r/508070
[10:36:08] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] convert exception error strings to utf8, thanks a lot python3 [dumps] - 10https://gerrit.wikimedia.org/r/507977 (owner: 10ArielGlenn)
[10:36:19] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] make page ranges for stubs be ints [dumps] - 10https://gerrit.wikimedia.org/r/507976 (owner: 10ArielGlenn)
[10:37:05] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] reduce sleep time between wikis for adds-changes dumps [dumps] - 10https://gerrit.wikimedia.org/r/508070 (owner: 10ArielGlenn)
[10:38:24] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@26b52ef]: misc small fixes, reduce sleep time for incr wikis
[10:38:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:33] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@26b52ef]: misc small fixes, reduce sleep time for incr wikis (duration: 00m 09s)
[10:38:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:35] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[13:37:05] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[14:07:41] <icinga-wm>	 PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:15:37] <wikibugs>	 (03PS1) 10Volans: Drop support for Python 3.4 [software/cumin] - 10https://gerrit.wikimedia.org/r/508078
[14:15:39] <wikibugs>	 (03PS1) 10Volans: tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079
[14:15:41] <wikibugs>	 (03PS1) 10Volans: flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080
[14:15:43] <wikibugs>	 (03PS1) 10Volans: documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081
[14:16:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 (owner: 10Volans)
[14:16:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans)
[14:16:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 (owner: 10Volans)
[14:25:10] <wikibugs>	 (03PS2) 10Volans: tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079
[14:25:12] <wikibugs>	 (03PS2) 10Volans: flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080
[14:25:14] <wikibugs>	 (03PS2) 10Volans: documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081
[14:25:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/508081 (owner: 10Volans)
[14:25:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] flake8: enforce import order and adopt W504 [software/cumin] - 10https://gerrit.wikimedia.org/r/508080 (owner: 10Volans)
[14:25:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans)
[14:33:03] <wikibugs>	 (03CR) 10Volans: "Blocked by https://phabricator.wikimedia.org/T222512" [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans)
[14:34:15] <icinga-wm>	 RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:47:37] <icinga-wm>	 PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:11:25] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[15:12:01] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[15:14:07] <icinga-wm>	 RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[15:17:51] <icinga-wm>	 PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:24:35] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[15:25:11] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[15:44:21] <icinga-wm>	 RECOVERY - puppet last run on mc1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:05:19] <wikibugs>	 (03PS1) 10Ammarpad: Enable Page Previews as default on hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017)
[16:58:23] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[17:04:57] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[22:17:30] <wikibugs>	 (03PS1) 10Reedy: Add flow, lsearch and sitelist to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508100
[22:18:04] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Add flow, lsearch and sitelist to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508100 (owner: 10Reedy)
[22:19:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add flow, lsearch and sitelist to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508100 (owner: 10Reedy)
[22:19:20] <wikibugs>	 (03CR) 10jenkins-bot: Add flow, lsearch and sitelist to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508100 (owner: 10Reedy)
[22:20:42] <logmsgbot>	 !log reedy@deploy1001 Synchronized docroot/mediawiki/xml/index.html: Add extra xml namespace links (duration: 01m 06s)
[22:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:26:33] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:26:37] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:26:37] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:26:55] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:26:59] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:26:59] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:27:09] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:27:13] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:27:13] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:27:37] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[23:27:51] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[23:27:57] <Krenair>	 that doesn't sound good
[23:28:16] <chaomodus>	 what's up
[23:28:23] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[23:28:25] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[23:28:35] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:29:03] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[23:29:49] <Krenair>	 chaomodus, possibly the same issue again?
[23:29:55] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[23:30:00] <Krenair>	 user report at https://phabricator.wikimedia.org/T222418#5157744
[23:30:03] <chaomodus>	 yah looks the same
[23:30:09] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[23:30:12] <chaomodus>	 same symptoms anyways
[23:30:24] <chaomodus>	 that mailbox lag thing
[23:32:33] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:33:02] <Krenair>	 https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now
[23:33:09] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:33:11] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:33:12] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Samwilson)
[23:33:13] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:33:29] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:33:35] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:33:35] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:33:39] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Krenair)
[23:33:45] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:33:47] <ToBeFree>	 O.o Krenair https://en.wikipedia.org/w/index.php?title=Draft:Lassana_N%27Diaye&diff=895537723&oldid=895536904&diffmode=source
[23:33:47] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:33:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:33:56] <chaomodus>	 transient
[23:34:04] <ToBeFree>	 substitution is broken
[23:34:05] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[23:34:13] <chaomodus>	 backends look like they recovered
[23:34:18] <chaomodus>	 err, the graphs i mean.
[23:34:21] <Krenair>	 ToBeFree, isn't there a way for users to trigger that behaviour?
[23:34:32] <ToBeFree>	 possibly but the user asks for help and has no idea
[23:34:45] <Krenair>	 this may be a distraction from the issue at hand
[23:34:51] <Krenair>	 I suggest putting a task in
[23:35:09] <ToBeFree>	 I thought the timing may explain a connection. hm
[23:35:20] <ToBeFree>	 I got 500 errors on userpages at the same time
[23:35:49] <ToBeFree>	 ah well, it's likely fixed now, I just thought there is an explanation for this as a known issue or something like that
[23:35:54] * ToBeFree shrugs
[23:36:23] <JJMC89>	 ToBeFree: It is caused by an unclosed HTML comment.
[23:36:46] <Krenair>	 https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-48h&to=now
[23:36:57] <Krenair>	 does not look as bad as when this happened two days ago
[23:37:33] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[23:37:45] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[23:37:55] <ToBeFree>	 gaah
[23:38:09] <ToBeFree>	 JJMC89: thank you so much, that would have taken a while for me to debug
[23:38:22] <ToBeFree>	 I was about to re-do the edit
[23:38:55] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[23:39:27] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[23:39:31] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[23:44:33] <Krenair>	 chaomodus, I notice the Varnish backend number of objects graph in particular - servers just suddenly dropping to 0?
[23:44:36] <Krenair>	 are they restarting?
[23:45:05] <chaomodus>	 idkx I didnv't restart them, but it's psosible
[23:47:33] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[23:58:56] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Krenair) possible continuation of https://wikitech.wikimedia.org/wiki/Incident_documentation/20190503-varnish