[00:12:44] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 2
[01:00:54] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 1 down 0
[02:13:23] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 786563.2765957445 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:13:34] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 587195.8912466845 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:26:24] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:26:43] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:31:43] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 483948.49753086426 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[02:33:33] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 56351.585714285706 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[02:34:33] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:34:44] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:39:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 436894.3608748481 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:40:34] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 354850.66071428574 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[02:42:53] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:43:34] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:48:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 246977.91371681416 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[02:49:43] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 280467.4350282486 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:50:54] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:51:43] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:56:54] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 202016.55496264674 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[02:57:43] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 430390.36091127095 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[03:00:03] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:00:44] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:05:03] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 207008.50635593222 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[03:06:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 287651.9811764706 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:07:03] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:09:53] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:14:03] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 152391.0781089414 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:15:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 193913.6470588235 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:16:04] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:17:54] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:21:13] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 330174.0902004455 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:23:54] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 311970.96321839077 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[03:24:13] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:26:23] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 769.46 seconds
[03:27:03] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:30:13] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 87875.80880330125 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:32:13] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:33:03] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 301060.68720930227 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[03:35:53] <icinga-wm>	 PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:36:03] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:36:14] <icinga-wm>	 PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:37:24] <icinga-wm>	 PROBLEM - puppet last run on mw1309 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.test]
[03:38:14] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 230793.1486033519 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:40:14] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:41:03] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 533680.0151006712 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:45:03] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:46:23] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 294832.28459821426 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[03:48:23] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:53:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 104.55 seconds
[04:00:53] <icinga-wm>	 RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[04:01:14] <icinga-wm>	 RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[04:02:24] <icinga-wm>	 RECOVERY - puppet last run on mw1309 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:48:36] <wikibugs>	 (03PS1) 10Gergő Tisza: Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086)
[09:51:52] <wikibugs>	 (03PS2) 10Gergő Tisza: Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086)
[09:52:39] <wikibugs>	 (03CR) 10Gergő Tisza: "Not sure how to test this. Maybe it would be better to only enable it on half of the appservers for a week, and then compare?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza)
[09:55:59] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-redis0[12] - https://phabricator.wikimedia.org/T191163#4095767 (10MarcoAurelio)
[10:03:23] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#4095780 (10MarcoAurelio) >>! In T187736#3984348, @Paladox wrote: > I think this was just deleted but was never removed from shinken....
[10:24:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) timed
[10:24:03] <icinga-wm>	 nse was received
[10:25:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:25:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a
[10:25:04] <icinga-wm>	 ved
[10:25:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:25:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) timed
[10:25:04] <icinga-wm>	 nse was received
[10:25:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for Apri
[10:25:05] <icinga-wm>	 ut before a response was received
[10:25:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:25:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:25:43] <icinga-wm>	 PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:25:43] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) timed out before a response was received: /{domain}/v1/page/references/{title}{/revision}{/tid} (retrieve structured reference data for the Cat article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/css/mobile/app/site (Untitled tes
[10:25:43] <icinga-wm>	  a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received: /{
[10:25:43] <icinga-wm>	 ia/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) timed out before a response was received
[10:25:43] <icinga-wm>	 PROBLEM - DPKG on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[10:25:53] <icinga-wm>	 PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused
[10:26:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy
[10:26:04] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:26:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out befor
[10:26:04] <icinga-wm>	 ceived
[10:26:14] <icinga-wm>	 PROBLEM - apertium apy on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:26:14] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:26:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[10:26:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneTyp
[10:26:24] <icinga-wm>	 ribute get)
[10:26:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) is CRITICAL: Test Get references of a test page returned the unexpected status 504 (expecting: 200)
[10:26:34] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[10:26:43] <icinga-wm>	 RECOVERY - DPKG on scb1001 is OK: All packages OK
[10:27:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy
[10:27:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy
[10:27:04] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[10:27:13] <icinga-wm>	 RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 4.640 second response time
[10:28:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[10:28:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[10:28:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[10:28:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[10:28:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[10:28:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[10:28:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[10:28:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[10:28:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[10:29:29] <wikibugs>	 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4095791 (10Aklapper) @KATMAKROFAN: The Wikimedia movement (metawiki) and the Wikimedia Foundation (foundationw...
[10:29:34] <icinga-wm>	 RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[10:31:53] <icinga-wm>	 RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.028 second response time
[10:32:44] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received
[10:34:13] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[10:34:53] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy
[11:09:29] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4095839 (10EddieGP) You can't, this is limited to project admins, where you're just an project member.  I expect @Ottomata can give us an hint wh...
[11:49:14] <aharoni>	 um... it's not urgent (at least for me), but I cannot ssh to terbium
[11:49:38] <aharoni>	 ssh: connect to host bast1001.wikimedia.org port 22: Operation timed out ssh_exchange_identification: Connection closed by remote host
[12:11:32] <volans>	 aharoni: bast1001 has been decommissioned, use bast1002 instead or any of the others (see email, but cannot recall which ML right now ;) )
[12:11:54] <aharoni>	 volans: thanks, trying...
[12:14:37] <aharoni>	 volans: it worked, thanks
[12:14:53] <volans>	 yw
[12:23:32] <wikibugs>	 (03PS2) 10EddieGP: hiera: Kill some hiera paths [labs/private] - 10https://gerrit.wikimedia.org/r/423189
[12:23:35] <wikibugs>	 (03PS2) 10EddieGP: Add note: most stuff here not used in cloud vps hiera [labs/private] - 10https://gerrit.wikimedia.org/r/423233
[13:14:27] <wikibugs>	 (03PS1) 10Urbanecm: Throttle rule for 2018-04-04, clean obsolete rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423340 (https://phabricator.wikimedia.org/T191168)
[13:14:53] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[13:28:53] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[14:42:10] <wikibugs>	 (03Abandoned) 10Gergő Tisza: Adding config for WikimediaEvents module for logging behaviour data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408386 (https://phabricator.wikimedia.org/T186244) (owner: 10Groovier1)
[14:42:13] <wikibugs>	 (03Abandoned) 10Gergő Tisza: Adding config for WikimediaEvents module for logging behaviour data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408381 (https://phabricator.wikimedia.org/T186244) (owner: 10Groovier1)
[15:59:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) timed out before a response was received
[15:59:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[15:59:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[16:00:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[16:00:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[16:00:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[16:01:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[16:01:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy
[16:01:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[16:02:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[16:02:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[16:02:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[17:18:34] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[18:18:53] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[19:55:24] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#4096076 (10Umar) What should I do to fix the problem? Excuse for troubling!
[20:11:19] <wikibugs>	 10Operations, 10Dumps-Generation: Write issues on dumpsdata1001 - https://phabricator.wikimedia.org/T191177#4096079 (10ArielGlenn) p:05Triage>03High
[20:26:22] <wikibugs>	 10Operations, 10Dumps-Generation: Write issues on dumpsdata1001 - https://phabricator.wikimedia.org/T191177#4096110 (10ArielGlenn) The original error I saw on snapshot1001 was: ``` 2018-04-01 19:56:56: cawiki Checksumming cawiki-20180301-langlinks.sql.gz via sha1 Traceback (most recent call last):   File "./wo...
[20:29:55] <wikibugs>	 10Operations, 10Dumps-Generation: Write issues on dumpsdata1001 - https://phabricator.wikimedia.org/T191177#4096123 (10ArielGlenn) I'm pretty out of commission right now so I have no idea if these are related or even if that first error is somehow normal. Will look tomorrow.
[20:37:14] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[21:17:33] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1