[00:03:59] <icinga-wm>	 PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:25] <icinga-wm>	 PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:27] <icinga-wm>	 PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:07] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:21] <icinga-wm>	 PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:57:28] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton
[01:00:07] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[01:00:27] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:00:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:01:13] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[01:02:45] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:03:11] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:06:45] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[01:06:53] <icinga-wm>	 PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[01:19:27] <icinga-wm>	 PROBLEM - DPKG on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[01:20:27] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[01:20:29] <icinga-wm>	 PROBLEM - configured eth on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[01:21:35] <icinga-wm>	 PROBLEM - dhclient process on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[01:22:51] <icinga-wm>	 PROBLEM - Disk space on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops
[01:31:47] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[01:32:07] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:32:27] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:37:37] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1004 is OK: OK: synced at Sun 2020-07-12 01:37:37 UTC. https://wikitech.wikimedia.org/wiki/NTP
[01:39:27] <icinga-wm>	 RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[01:43:43] <icinga-wm>	 RECOVERY - Disk space on kubernetes1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops
[01:50:21] <icinga-wm>	 RECOVERY - DPKG on kubernetes1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[01:51:21] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[01:51:23] <icinga-wm>	 RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[01:52:29] <icinga-wm>	 RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[02:00:05] <icinga-wm>	 PROBLEM - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary outbound port utilisation over 80% #page (asw2-esams.mgmt.esams.wmnet) https://docs.google.com/document/d/1SeXdegjsfL94R6XYB1I4Uv8yjCPH1tVXeL0taJF0NNs/preview%23heading=h.qkfum7lgbdo5
[02:01:40] <XioNoX>	 rzl: looks like monitoring glitch
[02:01:57] <rzl>	 dang, sure does
[02:01:58] <icinga-wm>	 RECOVERY - LibreNMS has a critical alert #page on icinga1001 is OK: OK: zero critical LibreNMS alerts https://docs.google.com/document/d/1SeXdegjsfL94R6XYB1I4Uv8yjCPH1tVXeL0taJF0NNs/preview%23heading=h.qkfum7lgbdo5
[02:02:23] <cdanis>	 sigh
[02:02:29] <XioNoX>	 https://phabricator.wikimedia.org/T252630
[02:02:33] <cdanis>	 yeah agreed, I think we saw this once before
[02:02:35] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:02:43] <cdanis>	 ah as usual, beaten to the punch by arzhel
[02:03:41] <XioNoX>	 we could add a condition to the librenms check so it needs to be < 101% or something like that as a workaround
[02:04:17] <wikibugs>	 10Operations, 10netops, 10observability: LibreNMS monitoring glitch caused paging - https://phabricator.wikimedia.org/T252630 (10CDanis) This happened again just now.  Something else we could do as a temporary mitigation is just accept a longer time-to-page in legitimate incidents and increase retries to req...
[02:04:29] <cdanis>	 oh, that would be good too
[02:05:47] <XioNoX>	 yeah longer time to page too, saw your comment
[02:05:59] <XioNoX>	 back to sleep!
[02:06:15] <rzl>	 o7 have a nice rest of your respective evenings
[02:07:36] <cdanis>	 o/
[02:08:09] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[02:11:02] <cdanis>	 gaaaaaaaaaaaaaaah I just noticed the notes_url
[03:57:11] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:55:13] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:58:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:25:25] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:26:15] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:28:05] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:29:07] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:35:07] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:36:09] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:01] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:51] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:42:39] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] VisualEditor: Explicitly set visualeditor-enable to 0 when non-default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610156 (https://phabricator.wikimedia.org/T248343) (owner: 10C. Scott Ananian)
[06:44:35] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:46:27] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200712T0700)
[07:05:49] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:11:41] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:59:57] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:05:45] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:04:17] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[09:06:17] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:06:35] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:06:57] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:06:59] <icinga-wm>	 PROBLEM - configured eth on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[09:07:55] <icinga-wm>	 PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:08:03] <icinga-wm>	 PROBLEM - dhclient process on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[09:24:03] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[09:32:09] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[09:32:35] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:35:33] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:37:47] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:37:47] <icinga-wm>	 RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[09:38:53] <icinga-wm>	 RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[09:40:27] <icinga-wm>	 RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:54:53] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1004 is OK: OK: synced at Sun 2020-07-12 09:54:52 UTC. https://wikitech.wikimedia.org/wiki/NTP
[10:12:19] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:12:27] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman - https://phabricator.wikimedia.org/T240929 (10Peachey88) >>! In T240929#5763716, @jcrespo wrote: > I am personally not familiar with mailman format. Maybe @Herron, our mail expert, knows...
[10:21:31] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:21:35] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:25:15] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:27:11] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:32:45] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:36:25] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:38:33] <wikibugs>	 (03CR) 10Jforrester: "I'll deploy this on Monday, if I remember." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594214 (owner: 10Jforrester)
[11:58:19] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:27] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[12:01:15] <icinga-wm>	 PROBLEM - configured eth on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[12:02:03] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:02:19] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:02:41] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[12:03:07] <icinga-wm>	 PROBLEM - MD RAID on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[12:03:27] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[12:03:31] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:03:49] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:07:53] <icinga-wm>	 PROBLEM - MD RAID on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[12:10:33] <icinga-wm>	 PROBLEM - dhclient process on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[12:10:47] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:13:03] <icinga-wm>	 PROBLEM - Disk space on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops
[12:13:23] <icinga-wm>	 PROBLEM - DPKG on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[12:15:05] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[12:16:47] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:15] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes2004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[12:18:43] <icinga-wm>	 RECOVERY - MD RAID on kubernetes2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[12:19:35] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:19:51] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:20:09] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:24:47] <icinga-wm>	 RECOVERY - MD RAID on kubernetes2004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[12:32:05] <icinga-wm>	 RECOVERY - configured eth on kubernetes2004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[12:33:33] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes2002 is OK: OK: synced at Sun 2020-07-12 12:33:31 UTC. https://wikitech.wikimedia.org/wiki/NTP
[12:33:55] <icinga-wm>	 RECOVERY - Disk space on kubernetes2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops
[12:34:41] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:41:23] <icinga-wm>	 RECOVERY - dhclient process on kubernetes2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[12:41:37] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:44:13] <icinga-wm>	 RECOVERY - DPKG on kubernetes2004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[13:10:00] <wikibugs>	 10Operations, 10Analytics-Radar, 10Gerrit: upgrade git-review to >= 1.27 on all stretch hosts across the board - https://phabricator.wikimedia.org/T257609 (10QChris)
[15:37:13] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton
[15:38:59] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[16:18:52] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] zuul: stop prefixing report with the job name [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar)
[16:45:43] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:49:25] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:53:03] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:55:48] <wikibugs>	 (03PS1) 10Urbanecm: Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674)
[16:58:21] <wikibugs>	 (03PS3) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[16:58:37] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:58:42] <RhinosF1>	 Thanks Urbanecm
[16:58:45] <Urbanecm>	 np
[16:59:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[16:59:42] <RhinosF1>	 Urbanecm: ^
[16:59:48] <Urbanecm>	 I see that ;)
[17:00:51] <wikibugs>	 (03PS4) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[17:04:09] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] Initial configuration for lijwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1)
[17:05:47] <RhinosF1>	 Urbanecm: I see, can fix tonight or up to you.
[17:05:55] <RhinosF1>	 (Attempt to fix)
[17:21:08] <wikibugs>	 (03PS1) 10Andrew Bogott: eqiad1 keystone: move database to galera on cloudcontrol hosts [puppet] - 10https://gerrit.wikimedia.org/r/611935 (https://phabricator.wikimedia.org/T242455)
[17:22:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] eqiad1 keystone: move database to galera on cloudcontrol hosts [puppet] - 10https://gerrit.wikimedia.org/r/611935 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott)
[17:58:42] <wikibugs>	 (03PS1) 10Tks4Fish: Add rollbacker to elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611937 (https://phabricator.wikimedia.org/T257745)
[18:06:28] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611937 (https://phabricator.wikimedia.org/T257745) (owner: 10Tks4Fish)
[23:52:05] <icinga-wm>	 PROBLEM - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 11448 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops
[23:53:39] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:57:19] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops