[00:03:59] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:25] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:27] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:07] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:21] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:28] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [01:00:07] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:00:27] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:01:13] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [01:02:45] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:03:11] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:06:45] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [01:06:53] PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:19:27] PROBLEM - DPKG on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [01:20:27] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:20:29] PROBLEM - configured eth on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [01:21:35] PROBLEM - dhclient process on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [01:22:51] PROBLEM - Disk space on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops [01:31:47] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:32:07] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:27] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:37:37] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1004 is OK: OK: synced at Sun 2020-07-12 01:37:37 UTC. https://wikitech.wikimedia.org/wiki/NTP [01:39:27] RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:43:43] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops [01:50:21] RECOVERY - DPKG on kubernetes1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [01:51:21] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:51:23] RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [01:52:29] RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [02:00:05] PROBLEM - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary outbound port utilisation over 80% #page (asw2-esams.mgmt.esams.wmnet) https://docs.google.com/document/d/1SeXdegjsfL94R6XYB1I4Uv8yjCPH1tVXeL0taJF0NNs/preview%23heading=h.qkfum7lgbdo5 [02:01:40] rzl: looks like monitoring glitch [02:01:57] dang, sure does [02:01:58] RECOVERY - LibreNMS has a critical alert #page on icinga1001 is OK: OK: zero critical LibreNMS alerts https://docs.google.com/document/d/1SeXdegjsfL94R6XYB1I4Uv8yjCPH1tVXeL0taJF0NNs/preview%23heading=h.qkfum7lgbdo5 [02:02:23] sigh [02:02:29] https://phabricator.wikimedia.org/T252630 [02:02:33] yeah agreed, I think we saw this once before [02:02:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:02:43] ah as usual, beaten to the punch by arzhel [02:03:41] we could add a condition to the librenms check so it needs to be < 101% or something like that as a workaround [02:04:17] 10Operations, 10netops, 10observability: LibreNMS monitoring glitch caused paging - https://phabricator.wikimedia.org/T252630 (10CDanis) This happened again just now. Something else we could do as a temporary mitigation is just accept a longer time-to-page in legitimate incidents and increase retries to req... [02:04:29] oh, that would be good too [02:05:47] yeah longer time to page too, saw your comment [02:05:59] back to sleep! [02:06:15] o7 have a nice rest of your respective evenings [02:07:36] o/ [02:08:09] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:11:02] gaaaaaaaaaaaaaaah I just noticed the notes_url [03:57:11] RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:13] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:58:55] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:25:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:15] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:05] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:29:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:07] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:36:09] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:51] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:42:39] (03CR) 10Legoktm: [C: 03+1] VisualEditor: Explicitly set visualeditor-enable to 0 when non-default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/610156 (https://phabricator.wikimedia.org/T248343) (owner: 10C. Scott Ananian) [06:44:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:46:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200712T0700) [07:05:49] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:11:41] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:59:57] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:05:45] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:04:17] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:06:17] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:06:35] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:57] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:06:59] PROBLEM - configured eth on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [09:07:55] PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:08:03] PROBLEM - dhclient process on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [09:24:03] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [09:32:09] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:32:35] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:33] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:37:47] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:37:47] RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [09:38:53] RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [09:40:27] RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:54:53] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1004 is OK: OK: synced at Sun 2020-07-12 09:54:52 UTC. https://wikitech.wikimedia.org/wiki/NTP [10:12:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:12:27] 10Operations, 10Wikimedia-Mailing-lists: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman - https://phabricator.wikimedia.org/T240929 (10Peachey88) >>! In T240929#5763716, @jcrespo wrote: > I am personally not familiar with mailman format. Maybe @Herron, our mail expert, knows... [10:21:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:21:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:25:15] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:27:11] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:32:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:36:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:38:33] (03CR) 10Jforrester: "I'll deploy this on Monday, if I remember." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594214 (owner: 10Jforrester) [11:58:19] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:27] PROBLEM - Check size of conntrack table on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:01:15] PROBLEM - configured eth on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:02:03] PROBLEM - puppet last run on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:02:19] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:02:41] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [12:03:07] PROBLEM - MD RAID on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:03:27] PROBLEM - Check size of conntrack table on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:03:31] PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:49] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:07:53] PROBLEM - MD RAID on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:10:33] PROBLEM - dhclient process on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [12:10:47] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:13:03] PROBLEM - Disk space on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops [12:13:23] PROBLEM - DPKG on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:15:05] RECOVERY - Check size of conntrack table on kubernetes2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:16:47] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:15] RECOVERY - Check size of conntrack table on kubernetes2004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:18:43] RECOVERY - MD RAID on kubernetes2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:19:35] RECOVERY - puppet last run on kubernetes2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:19:51] RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:20:09] RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:47] RECOVERY - MD RAID on kubernetes2004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:32:05] RECOVERY - configured eth on kubernetes2004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:33:33] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes2002 is OK: OK: synced at Sun 2020-07-12 12:33:31 UTC. https://wikitech.wikimedia.org/wiki/NTP [12:33:55] RECOVERY - Disk space on kubernetes2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops [12:34:41] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:41:23] RECOVERY - dhclient process on kubernetes2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [12:41:37] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:44:13] RECOVERY - DPKG on kubernetes2004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:10:00] 10Operations, 10Analytics-Radar, 10Gerrit: upgrade git-review to >= 1.27 on all stretch hosts across the board - https://phabricator.wikimedia.org/T257609 (10QChris) [15:37:13] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [15:38:59] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [16:18:52] (03CR) 10Krinkle: [C: 03+1] zuul: stop prefixing report with the job name [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar) [16:45:43] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:49:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:53:03] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:55:48] (03PS1) 10Urbanecm: Initial configuration for arywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611934 (https://phabricator.wikimedia.org/T257674) [16:58:21] (03PS3) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [16:58:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:58:42] Thanks Urbanecm [16:58:45] np [16:59:10] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [16:59:42] Urbanecm: ^ [16:59:48] I see that ;) [17:00:51] (03PS4) 10Urbanecm: Initial configuration for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [17:04:09] (03CR) 10Urbanecm: [C: 04-1] Initial configuration for lijwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611435 (https://phabricator.wikimedia.org/T257672) (owner: 10RhinosF1) [17:05:47] Urbanecm: I see, can fix tonight or up to you. [17:05:55] (Attempt to fix) [17:21:08] (03PS1) 10Andrew Bogott: eqiad1 keystone: move database to galera on cloudcontrol hosts [puppet] - 10https://gerrit.wikimedia.org/r/611935 (https://phabricator.wikimedia.org/T242455) [17:22:02] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1 keystone: move database to galera on cloudcontrol hosts [puppet] - 10https://gerrit.wikimedia.org/r/611935 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [17:58:42] (03PS1) 10Tks4Fish: Add rollbacker to elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611937 (https://phabricator.wikimedia.org/T257745) [18:06:28] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/611937 (https://phabricator.wikimedia.org/T257745) (owner: 10Tks4Fish) [23:52:05] PROBLEM - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 11448 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [23:53:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:57:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops