[00:16:41] (03PS1) 10Dave Pifke: arclamp: Remove residue from old renaming [puppet] - 10https://gerrit.wikimedia.org/r/611472 [00:33:42] (03CR) 10Dzahn: [C: 03+2] arclamp: Remove residue from old renaming [puppet] - 10https://gerrit.wikimedia.org/r/611472 (owner: 10Dave Pifke) [00:50:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:11] PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:15:41] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:15:41] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:27] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [01:17:23] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:20:13] Is the Toolforge down? [01:20:52] I can still ssh - what makes you think it might be down? [01:21:16] Because it won't load a tool from there? [01:21:18] :P [01:21:32] Must have been me then. It's working :OOOOOO [01:22:18] There's a few tools running stuff that shouldn't be [01:22:27] But not causing major load issues [01:29:35] PROBLEM - DPKG on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [01:32:21] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:34:57] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:36:53] RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:47:21] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1004 is OK: OK: synced at Sat 2020-07-11 01:47:19 UTC. https://wikitech.wikimedia.org/wiki/NTP [02:00:27] RECOVERY - DPKG on kubernetes1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [04:10:13] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:12:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:49:59] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 56 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:55:51] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200711T0700) [07:17:43] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) >>! In T252391#6297445, @Krinkle wrote: > CentralAuth and ChronologyProtector are both still high-profile consumers of main stash. Both are scheduled f... [08:25:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:29:33] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:37:25] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 55 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:39:01] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:44:27] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [08:45:07] PROBLEM - DPKG on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:45:35] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:46:11] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:46:13] PROBLEM - configured eth on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [08:47:11] PROBLEM - dhclient process on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [08:49:07] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:54:33] PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:59:07] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:59:41] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:11] PROBLEM - Check size of conntrack table on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:00:33] PROBLEM - Disk space on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops [09:02:57] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [09:03:01] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={0,1} site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&v [09:03:01] ad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [09:05:43] PROBLEM - DPKG on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:08:15] PROBLEM - MD RAID on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:08:21] PROBLEM - IPMI Sensor Status on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:08:35] PROBLEM - configured eth on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [09:14:58] RECOVERY - Check size of conntrack table on kubernetes2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:16:17] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:37] RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:19:05] RECOVERY - MD RAID on kubernetes2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:23:23] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [09:32:37] PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:49] PROBLEM - MD RAID on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:36:35] RECOVERY - DPKG on kubernetes2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:36:41] PROBLEM - Check size of conntrack table on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:36:43] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [09:37:57] PROBLEM - puppet last run on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:38:27] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [09:39:25] RECOVERY - configured eth on kubernetes2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [09:40:21] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [09:43:35] PROBLEM - SSH on kubernetes2004 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:44:03] PROBLEM - dhclient process on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [09:44:33] PROBLEM - configured eth on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [09:45:58] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [09:47:11] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:47:17] RECOVERY - SSH on kubernetes2004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:47:47] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [09:49:39] RECOVERY - Check size of conntrack table on kubernetes2004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:49:41] RECOVERY - puppet last run on kubernetes2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:57:31] RECOVERY - MD RAID on kubernetes2004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:07:33] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes2004 is OK: OK: synced at Sat 2020-07-11 10:07:32 UTC. https://wikitech.wikimedia.org/wiki/NTP [10:14:55] RECOVERY - dhclient process on kubernetes2004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [10:15:21] RECOVERY - configured eth on kubernetes2004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [10:18:01] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:46:18] PROBLEM - Long running screen/tmux on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [12:20:19] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32909312 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:20:21] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 41820568 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:22:11] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 78 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:22:11] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 36824 and 79 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:36:59] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:40:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:02:21] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [14:06:55] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:08:51] RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:11:25] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1004 is OK: OK: synced at Sat 2020-07-11 14:11:23 UTC. https://wikitech.wikimedia.org/wiki/NTP [14:13:13] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops [14:16:48] RECOVERY - IPMI Sensor Status on kubernetes1004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:24:19] RECOVERY - DPKG on kubernetes1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:25:19] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:25:21] RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:26:25] RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [14:47:08] RECOVERY - Long running screen/tmux on kubernetes1004 is OK: OK: Tmux detected but not long running. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [15:14:20] (03PS3) 10Urbanecm: Initial configuration for sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609768 (https://phabricator.wikimedia.org/T256545) [15:19:19] (03CR) 10Urbanecm: [C: 04-1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594214 (owner: 10Jforrester) [16:27:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:28:47] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:11] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:21:39] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:21:51] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:21:55] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:21:55] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [18:21:55] PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:29:44] (03PS1) 10QChris: Drop gerrit1002 from targets [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611804 [18:29:46] (03PS1) 10QChris: Bump gerrit.war to Gerrit 3.2.2-138-g230805407f [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611805 (https://phabricator.wikimedia.org/T257186) [18:29:49] (03PS1) 10QChris: Bump zuul plugin to master-12-ge51d7e8 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611806 (https://phabricator.wikimedia.org/T257198) [18:34:41] PROBLEM - DPKG on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:35:05] PROBLEM - Disk space on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops [18:35:43] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:35:45] PROBLEM - configured eth on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:36:30] (03CR) 10QChris: [V: 03+2 C: 03+2] Drop gerrit1002 from targets [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611804 (owner: 10QChris) [18:36:35] (03CR) 10QChris: [V: 03+2 C: 03+2] Bump gerrit.war to Gerrit 3.2.2-138-g230805407f [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611805 (https://phabricator.wikimedia.org/T257186) (owner: 10QChris) [18:36:42] (03CR) 10QChris: [V: 03+2 C: 03+2] Bump zuul plugin to master-12-ge51d7e8 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611806 (https://phabricator.wikimedia.org/T257198) (owner: 10QChris) [18:36:51] PROBLEM - dhclient process on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:45:15] PROBLEM - IPMI Sensor Status on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:50:25] (03CR) 10QChris: "The CI table machinery does not rely the prepended jobname. Agreed." [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar) [18:55:35] !log qchris@deploy1001 Started deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit2001 [18:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:45] !log qchris@deploy1001 Finished deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit2001 (duration: 00m 10s) [18:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:35] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:00:49] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:49] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:05:17] RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:05:31] RECOVERY - DPKG on kubernetes1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:06:33] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:06:37] RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:07:41] RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:08:38] !log Restarting Gerrit on gerrit2001 to switch to new gerrit.war and zuul plugin [19:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:53] !log qchris@deploy1001 Started deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit1001 [19:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:00] !log qchris@deploy1001 Finished deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit1001 (duration: 00m 07s) [19:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:05] RECOVERY - IPMI Sensor Status on kubernetes1004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:16:36] !log Restarting Gerrit on gerrit1001 to switch to new gerrit.war and zuul plugin [19:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:45] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops [19:23:37] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1004 is OK: OK: synced at Sat 2020-07-11 19:23:35 UTC. https://wikitech.wikimedia.org/wiki/NTP [19:39:43] I don't know who chose that format name [19:39:55] who would want a gerrit war? :P [20:19:05] Platonides: heh for a moment I thought there was such a threat [20:26:09] :D [20:27:02] I read above "new gerrit war"... [21:18:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:19:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets