[00:04:19] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:15] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:29] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:29] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:35] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:03] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [03:49:43] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [04:06:15] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [04:26:23] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 64.07 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [06:29:05] PROBLEM - ores uWSGI web app on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:33] PROBLEM - Check systemd state on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:43] PROBLEM - Check size of conntrack table on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:31:13] PROBLEM - puppet last run on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:36:07] PROBLEM - MD RAID on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:36:11] PROBLEM - configured eth on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:42:41] PROBLEM - dhclient process on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:42:43] PROBLEM - ores_workers_running on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:43:51] PROBLEM - Check whether ferm is active by checking the default input chain on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:46:19] PROBLEM - Disk space on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1003&var-datasource=eqiad+prometheus/ops [06:51:49] RECOVERY - ores_workers_running on ores1003 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:52:33] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:43] RECOVERY - Check size of conntrack table on ores1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:54:33] RECOVERY - puppet last run on ores1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:57:45] RECOVERY - MD RAID on ores1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200614T0700) [07:06:59] RECOVERY - configured eth on ores1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [07:07:09] RECOVERY - Disk space on ores1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1003&var-datasource=eqiad+prometheus/ops [07:13:29] RECOVERY - dhclient process on ores1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [07:14:41] RECOVERY - Check whether ferm is active by checking the default input chain on ores1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:37:59] So over the last day or too i've found myself logged out of phab, gerrit, and wikitech randomly - did something happen? [07:38:02] *two [07:54:49] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on cumin1001.eqiad.wmnet for hosts: ` ['kubernetes... [08:35:31] (03PS1) 10Alexandros Kosiaris: install: Switch kubernetes2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/605400 (https://phabricator.wikimedia.org/T252185) [08:42:51] (03PS2) 10Alexandros Kosiaris: install: Switch kubernetes2014 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/605400 (https://phabricator.wikimedia.org/T252185) [08:46:02] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris) kubernetes2007 has been reimage successfully, it seems like kubernetes2008 to kubernete... [10:38:09] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [11:31:11] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 52.88 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [12:14:37] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:15:13] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:18:39] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22781304 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:22:21] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 47424 and 76 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:51:51] !log Disabling puppet on gerrit1002 (test instance) to do some more upgrade testing [13:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:46] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10CDanis) [14:12:40] 10Operations, 10Traffic: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams - https://phabricator.wikimedia.org/T255368 (10CDanis) p:05Triage→03Medium [17:20:03] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:21:47] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [18:24:41] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 116.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [18:41:21] hey Amir1, pm me if you're around and have energy/time [18:50:26] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) @wiki_willy Please see below for the purchase information on the vertical duct used in C3. {F31865410} [19:16:07] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 52.88 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:09:29] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 107.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:00:45] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 25.42 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [22:34:29] (03CR) 10Andrew Bogott: "> The profile::openstack::eqiad1::galera::monitoring would have to be" [puppet] - 10https://gerrit.wikimedia.org/r/605315 (owner: 10Andrew Bogott)