[00:36:51] <icinga-wm>	 RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1
[00:47:15] <wikibugs>	 (03PS1) 10Andrew Bogott: grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508)
[00:48:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508) (owner: 10Andrew Bogott)
[01:31:44] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: remove the ancient version of kubectl [puppet] - 10https://gerrit.wikimedia.org/r/575325 (https://phabricator.wikimedia.org/T214513) (owner: 10Bstorm)
[01:57:16] <wikibugs>	 (03PS2) 10Andrew Bogott: grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508)
[02:02:27] <wikibugs>	 (03PS3) 10Andrew Bogott: grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508)
[02:52:16] <wikibugs>	 (03PS4) 10Andrew Bogott: grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508)
[02:57:28] <wikibugs>	 (03PS5) 10Andrew Bogott: grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508)
[03:01:49] <wikibugs>	 (03PS6) 10Andrew Bogott: grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508)
[03:04:31] <wikibugs>	 (03PS7) 10Andrew Bogott: grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508)
[03:07:48] <wikibugs>	 (03PS8) 10Andrew Bogott: grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508)
[03:09:12] <wikibugs>	 (03PS9) 10Andrew Bogott: grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508)
[03:11:08] <wikibugs>	 (03PS10) 10Andrew Bogott: grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508)
[03:30:55] <wikibugs>	 (03PS1) 10Bstorm: toolforge: fix front page check [puppet] - 10https://gerrit.wikimedia.org/r/575754
[03:57:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] toolforge: fix front page check [puppet] - 10https://gerrit.wikimedia.org/r/575754 (owner: 10Bstorm)
[04:16:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] grafana: remove the distinction between readonly and admin domains [puppet] - 10https://gerrit.wikimedia.org/r/575746 (https://phabricator.wikimedia.org/T246508) (owner: 10Andrew Bogott)
[04:38:29] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudmetrics: duplicate profile::grafana::ldap to host-specific hiera [puppet] - 10https://gerrit.wikimedia.org/r/575755
[04:40:47] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudmetrics: duplicate profile::grafana::ldap to host-specific hiera [puppet] - 10https://gerrit.wikimedia.org/r/575755
[04:42:59] <wikibugs>	 (03PS3) 10Andrew Bogott: cloudmetrics: duplicate profile::grafana::ldap to host-specific hiera [puppet] - 10https://gerrit.wikimedia.org/r/575755
[04:44:42] <wikibugs>	 (03PS4) 10Andrew Bogott: cloudmetrics: duplicate profile::grafana::ldap to host-specific hiera [puppet] - 10https://gerrit.wikimedia.org/r/575755
[04:50:35] <wikibugs>	 (03Abandoned) 10Andrew Bogott: cloudmetrics: duplicate profile::grafana::ldap to host-specific hiera [puppet] - 10https://gerrit.wikimedia.org/r/575755 (owner: 10Andrew Bogott)
[04:59:48] <wikibugs>	 (03Abandoned) 10Andrew Bogott: grafana-labs: remove auto-login feature [puppet] - 10https://gerrit.wikimedia.org/r/575642 (https://phabricator.wikimedia.org/T246502) (owner: 10Andrew Bogott)
[05:00:34] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "grafana: prevent Anonymous viewers from editing user settings" [puppet] - 10https://gerrit.wikimedia.org/r/575756
[05:00:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "grafana: prevent Anonymous viewers from editing user settings" [puppet] - 10https://gerrit.wikimedia.org/r/575756 (owner: 10Andrew Bogott)
[05:03:27] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "grafana: prevent Anonymous viewers from editing user settings" [puppet] - 10https://gerrit.wikimedia.org/r/575757
[05:03:29] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudmetrics: clean up the grafana-labs-admin endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575758 (https://phabricator.wikimedia.org/T246508)
[05:03:31] <wikibugs>	 (03PS1) 10Andrew Bogott: grafana: remove the obsolete X-WEBAUTH-USER hack for grafana logins [puppet] - 10https://gerrit.wikimedia.org/r/575759 (https://phabricator.wikimedia.org/T246508)
[05:09:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "grafana: prevent Anonymous viewers from editing user settings" [puppet] - 10https://gerrit.wikimedia.org/r/575757 (owner: 10Andrew Bogott)
[05:16:42] <wikibugs>	 (03PS2) 10Andrew Bogott: grafana: remove the obsolete X-WEBAUTH-USER hack for grafana logins [puppet] - 10https://gerrit.wikimedia.org/r/575759 (https://phabricator.wikimedia.org/T246508)
[05:50:44] <wikibugs>	 10Operations, 10Graphite: Make it easier to ban misbehaving dashboards from graphite - https://phabricator.wikimedia.org/T119718 (10Aklapper) @fgiunchedi: Hi, the patch in Gerrit has been abandoned. Do you still plan to work on this? Asking as you are set as task assignee.  Thanks in advance!
[06:02:48] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@8376c62]: refactor page content jobs, prefetch, and output file listings: see T246465
[06:02:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:52] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@8376c62]: refactor page content jobs, prefetch, and output file listings: see T246465 (duration: 00m 04s)
[06:02:53] <stashbot>	 T246465: clean up page content generation code and file listing methods as prep work for splitting page content generation across multiple servers - https://phabricator.wikimedia.org/T246465
[06:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:23] <apergos>	 $someday it would be nice to be able to have deployment windows that aren't just a day or two before the next run.
[06:04:44] <apergos>	 at least this code is a step towards that :-/
[06:05:02] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "The admin-legacy Ingress redirects /admin to /admin/. Interesting that we have seen this as flapping rather than always failing. I'm not s" [puppet] - 10https://gerrit.wikimedia.org/r/575754 (owner: 10Bstorm)
[06:11:40] <wikibugs>	 10Operations, 10Traffic: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236 (10Aklapper) @BBlack: Hi! This task has been assigned to you a while ago. Do you still plan to work on this task? If this task has been resolved in the meantime: Please update the task status (via {...
[06:12:32] <wikibugs>	 10Operations, 10RESTBase-Cassandra, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, 10Services (watching): setup an alertable threshold for Cassandra heap dumps - https://phabricator.wikimedia.org/T106346 (10Aklapper) @fgiunchedi: Hi! This task has been assigned to you a while ago....
[06:24:56] <wikibugs>	 10Operations, 10Traffic, 10conftool, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Aklapper)
[06:42:08] <wikibugs>	 10Operations: Install nodejs, nginx and other dependencies on francium - https://phabricator.wikimedia.org/T94457 (10Aklapper) @ArielGlenn: Hi, all related patches in Gerrit have been merged or abandoned. Do you still plan to work on this task / is there still more to do here? Asking as you are set as task assig...
[06:46:09] <wikibugs>	 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10Aklapper) @Eevans: Hi! This task has been assigned to you a while ago. Do you still plan to work on...
[06:51:58] <wikibugs>	 (03PS1) 10Gergő Tisza: Set GrowthExperiments help panel search API for beta viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575762 (https://phabricator.wikimedia.org/T246511)
[06:53:53] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552 (10Aklapper) @fgiunchedi: Hi, all related patches in Gerrit have been merged or abandoned. Is there more to do in this task? Asking as you are set as task a...
[06:54:17] <wikibugs>	 10Operations, 10observability, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552 (10Aklapper)
[06:55:00] <wikibugs>	 10Operations, 10Graphite: graphite clustering plan - https://phabricator.wikimedia.org/T86316 (10Aklapper) @fgiunchedi: Hi, the patch in Gerrit has been merged. Can this task be resolved (via {nav name=Add Action... > Change Status} in the dropdown menu), or is there more to do in this task? If there is more t...
[07:56:00] <wikibugs>	 (03PS1) 10Elukey: Revert BigTop Settings for Analytics Hadoop test UI [puppet] - 10https://gerrit.wikimedia.org/r/575765
[07:58:36] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:59:36] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:00:00] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:02:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert BigTop Settings for Analytics Hadoop test UI [puppet] - 10https://gerrit.wikimedia.org/r/575765 (owner: 10Elukey)
[08:05:32] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/6 UP : OSPFv3: 4/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:05:53] <cdanis>	 uh
[08:07:40] <elukey>	 GTT right?
[08:07:44] <elukey>	 is there maintenance?
[08:08:09] <cdanis>	 not on the calendar, but I already saw one missed maintenance this weekend
[08:08:23] <cdanis>	 it's 3 different routers though
[08:09:23] <elukey>	 seems knams/eqdfw no?
[08:09:39] <cdanis>	 and cr1-eqiad
[08:10:06] <elukey>	 ah didn't see it
[08:10:21] <cdanis>	 i wish the OSPF alerts included which neighbor
[08:10:29] <elukey>	 if both links are GTT to esams maybe there is a problem there
[08:12:13] <cdanis>	 on cr1-eqiad it's xe-4/2/2
[08:13:17] <elukey>	 that is GTT to cr2-knams?
[08:13:46] <cdanis>	 it's GTT, not sure about endpoint yet
[08:13:53] <cdanis>	 netbox doesn't have Z-side
[08:14:16] <elukey>	 checking on cr1
[08:14:32] <cdanis>	 ah, and it looks like it's a VPLS and we use a VLAN?
[08:14:57] <cdanis>	 I see same circuit ID for the eqdfw and knams ones
[08:15:06] <cdanis>	 ok I will email GTT
[08:15:09] <elukey>	 thanks!
[08:15:54] <elukey>	 can't find any email from them 
[08:16:18] <elukey>	 but probably they'll send something "in a bit"
[08:16:22] <elukey>	 when they realize the issue :D
[08:18:08] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:18:32] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:18:50] <elukey>	 ahahaha
[08:19:08] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:19:34] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:20:43] <cdanis>	 The following message to <noc@gtt.net> was undeliverable.
[08:20:45] <cdanis>	 sigh
[08:24:03] <wikibugs>	 10Operations, 10netops: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10CDanis)
[10:34:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:37:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:51:12] <icinga-wm>	 PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 122 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1
[11:18:20] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:18:46] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:19:32] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:20:08] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:20:34] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:20:58] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:21:08] <cdanis>	 must be the same GTT circuit
[11:21:13] <cdanis>	 fwiw it's just backup
[11:21:44] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:22:20] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:14:10] <icinga-wm>	 RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 70.67 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1
[12:26:46] <icinga-wm>	 RECOVERY - PHP opcache health on mw1251 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[13:54:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:00:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:07:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:09:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:18:02] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:20:12] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:24:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:28:58] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:35:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:42:00] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:44:10] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:48:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:52:56] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:16:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:17:44] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:21:18] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:22:06] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:27:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:28:38] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:30:48] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:32:12] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:40:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:45:18] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:58:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:01:20] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:01:42] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:03:32] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:03:46] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[16:04:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:05:51] <wikibugs>	 10Operations, 10Analytics: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey)
[16:05:53] <wikibugs>	 (03PS1) 10Reedy: Add viwiki to interwiki-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575786 (https://phabricator.wikimedia.org/T246511)
[16:06:11] <wikibugs>	 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey)
[16:06:21] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Add viwiki to interwiki-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575786 (https://phabricator.wikimedia.org/T246511) (owner: 10Reedy)
[16:07:26] <wikibugs>	 (03Merged) 10jenkins-bot: Add viwiki to interwiki-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575786 (https://phabricator.wikimedia.org/T246511) (owner: 10Reedy)
[16:08:30] <logmsgbot>	 !log reedy@deploy1001 scap failed: average error rate on 5/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
[16:08:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:42] <Reedy>	 o_0
[16:10:04] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:15:50] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:16:38] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:20:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:21:00] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:22:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:23:08] <wikibugs>	 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10elukey) Leaving a note since a lot of background work is happening, we didn't forget about this :)  The Analytics team is cu...
[16:23:10] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:31:06] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:37:40] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:40:38] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=
[16:44:12] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:47:10] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:47:54] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[16:48:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:49:58] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[16:50:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:51:32] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204,205} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-met
[16:52:18] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[16:52:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:54:30] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[16:55:54] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:59:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:02:26] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=
[17:03:00] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:03:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:05:04] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:05:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:08:58] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[17:10:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:13:22] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[17:14:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:15:22] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed
[17:15:22] <icinga-wm>	 ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:16:10] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:17:28] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:18:14] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:18:50] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:20:54] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:26:28] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[17:29:20] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:31:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:39:16] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:41:24] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[17:43:56] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[17:45:20] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:45:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce main traffic weight for db1087 as dumps are running ', diff saved to https://phabricator.wikimedia.org/P10563 and previous config saved to /var/cache/conftool/dbconfig/20200301-174536-marostegui.json
[17:45:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:21] <marostegui>	 !log Start replication on db1111 new host on s8 - T246447
[17:54:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:26] <stashbot>	 T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447
[18:03:21] <wikibugs>	 10Operations, 10Wikimedia-Site-requests: Disable and re-enable 2FA for user Iwan Novirion on Wikipedia - https://phabricator.wikimedia.org/T246585 (10Framawiki)
[18:15:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:17:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:21:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:25:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:35:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:37:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:49:21] <wikibugs>	 10Operations, 10Growth-Team, 10Mail, 10MediaWiki-Watchlist: Notifications about changes by Oznamovatel sent to Janbery doesn't seem to be reliable - https://phabricator.wikimedia.org/T245762 (10Aklapper) No reply for a week, boldly adding #operations because #mail
[18:50:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:54:40] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[18:54:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:59:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:59:28] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[19:01:32] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[19:03:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:05:32] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:09:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:12:11] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/575754 (owner: 10Bstorm)
[19:12:15] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: fix front page check [puppet] - 10https://gerrit.wikimedia.org/r/575754 (owner: 10Bstorm)
[19:14:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:14:14] <wikibugs>	 (03CR) 10Bstorm: "In fact, it may be a matter of when it reads the results.  The 302 is failure (for whatever reason), but if it is slower to read, it might" [puppet] - 10https://gerrit.wikimedia.org/r/575754 (owner: 10Bstorm)
[19:16:36] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[19:18:50] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:19:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:21:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:21:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:25:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:29:50] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[19:30:24] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9200 on cloudelastic1002 is CRITICAL: CRITICAL - ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[3](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[1](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[6
[19:30:24] <icinga-wm>	 4:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[2](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[0](2020-02-27T17:24:54.480Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:32:00] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:36:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:38:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:45:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:47:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:54:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:54:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:56:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:56:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:18:18] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[20:20:30] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:29:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:37:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:38:30] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[20:40:34] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[20:41:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:46] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[20:47:18] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[20:51:22] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f62c3461400: Failed to establish a new connection: [Errno 111] Connection refused,): /?spec=True https://wikitech.wikimedia.org/wiki/Citoid
[20:53:48] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[21:00:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:15:48] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[21:17:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:18:04] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[21:19:48] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:57:10] <icinga-wm>	 PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:57:22] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[21:57:28] <icinga-wm>	 PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[21:57:54] <icinga-wm>	 PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops
[21:58:00] <icinga-wm>	 PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[21:58:20] <icinga-wm>	 PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[21:58:36] <icinga-wm>	 PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[21:58:38] <icinga-wm>	 PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[21:58:52] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:00:19] <elukey>	 Amir1: o/ - can you check on stat1007 your scripts? As far as I can see the are using a lot of memory :(
[22:00:46] <Amir1>	 elukey: hey, yeah, I hope it finishes soon
[22:01:00] <Amir1>	 is it blocking other processes?
[22:01:04] <icinga-wm>	 PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:01:23] <elukey>	 nono but it contributes in saturating stat1007
[22:01:27] <Amir1>	 oh, is this because of my script? 
[22:01:29] <elukey>	 causing --^
[22:01:57] <Amir1>	 wow, maybe I can move it to another host
[22:02:14] <Amir1>	 it's a one-off thing 
[22:02:28] <elukey>	 ah okok then let's finish it
[22:02:54] <Amir1>	 I think it should finish by tomorrow (fingers crossed)
[22:05:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:05:33] <elukey>	 ack thanks!
[22:06:08] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[22:06:16] <icinga-wm>	 RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[22:06:40] <icinga-wm>	 RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops
[22:06:46] <icinga-wm>	 RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[22:07:06] <icinga-wm>	 RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[22:07:22] <icinga-wm>	 RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[22:07:24] <icinga-wm>	 RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 1 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[22:09:34] <icinga-wm>	 RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:12:14] <icinga-wm>	 RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:25:36] <icinga-wm>	 PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:27:44] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[22:28:00] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[22:28:08] <icinga-wm>	 PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[22:28:32] <icinga-wm>	 PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops
[22:28:36] <icinga-wm>	 PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[22:28:58] <icinga-wm>	 PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[22:29:12] <icinga-wm>	 PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[22:29:12] <RhinosF1>	 Amir1, elukey: ^
[22:29:14] <icinga-wm>	 PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[22:29:30] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:29:31] <Amir1>	 Okay I think I need to tame my script
[22:29:48] <Amir1>	 I don't know if there's a proper way to do it
[22:32:28] <icinga-wm>	 PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:32:37] <Amir1>	 it has 20GB free memory though
[22:35:41] <Amir1>	 it will be over in an hour or so
[22:35:49] <Amir1>	 most of it at least
[22:46:54] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:47:36] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[22:47:44] <icinga-wm>	 RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[22:47:56] <icinga-wm>	 RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:48:08] <icinga-wm>	 RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops
[22:48:12] <icinga-wm>	 RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[22:48:34] <icinga-wm>	 RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[22:48:50] <icinga-wm>	 RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[22:48:52] <icinga-wm>	 RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[22:51:00] <icinga-wm>	 RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:51:04] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9200 on cloudelastic1001 is CRITICAL: CRITICAL - ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[2](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[3](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[1](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7
[22:51:04] <icinga-wm>	 4:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[0](2020-02-27T17:24:54.480Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:58:54] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on stat1007 is OK: OK: synced at Sun 2020-03-01 22:58:53 UTC. https://wikitech.wikimedia.org/wiki/NTP
[23:26:34] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9200 on cloudelastic1003 is CRITICAL: CRITICAL - ebernhardson_deleteme[1](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[3](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5
[23:26:34] <icinga-wm>	 4:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[2](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[0](2020-02-27T17:24:54.480Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:46:38] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9200 on cloudelastic1004 is CRITICAL: CRITICAL - ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[2](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5
[23:46:38] <icinga-wm>	 4:54.480Z), ebernhardson_deleteme[3](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[1](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[0](2020-02-27T17:24:54.480Z) https://wikitech.wikimedia.org/wiki/Search%23Administration