[00:16:41] <wikibugs>	 (03PS1) 10Dave Pifke: arclamp: Remove residue from old renaming [puppet] - 10https://gerrit.wikimedia.org/r/611472
[00:33:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] arclamp: Remove residue from old renaming [puppet] - 10https://gerrit.wikimedia.org/r/611472 (owner: 10Dave Pifke)
[00:50:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:15:11] <icinga-wm>	 PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[01:15:41] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[01:15:41] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:16:27] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[01:17:23] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:20:13] <Bsadowski1>	 Is the Toolforge down?
[01:20:52] <DannyS712|away>	 I can still ssh - what makes you think it might be down?
[01:21:16] <Bsadowski1>	 Because it won't load a tool from there?
[01:21:18] <Bsadowski1>	 :P
[01:21:32] <Bsadowski1>	 Must have been me then. It's working :OOOOOO
[01:22:18] <Reedy>	 There's a few tools running stuff that shouldn't be
[01:22:27] <Reedy>	 But not causing major load issues
[01:29:35] <icinga-wm>	 PROBLEM - DPKG on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[01:32:21] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[01:34:57] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:36:53] <icinga-wm>	 RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[01:47:21] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1004 is OK: OK: synced at Sat 2020-07-11 01:47:19 UTC. https://wikitech.wikimedia.org/wiki/NTP
[02:00:27] <icinga-wm>	 RECOVERY - DPKG on kubernetes1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[04:10:13] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:12:05] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:49:59] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 56 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:55:51] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200711T0700)
[07:17:43] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) >>! In T252391#6297445, @Krinkle wrote: > CentralAuth and ChronologyProtector are both still high-profile consumers of main stash. Both are scheduled f...
[08:25:51] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:29:33] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:37:25] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 55 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:39:01] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:44:27] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[08:45:07] <icinga-wm>	 PROBLEM - DPKG on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:45:35] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:46:11] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:46:13] <icinga-wm>	 PROBLEM - configured eth on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[08:47:11] <icinga-wm>	 PROBLEM - dhclient process on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[08:49:07] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:54:33] <icinga-wm>	 PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[08:59:07] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:59:41] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:00:11] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[09:00:33] <icinga-wm>	 PROBLEM - Disk space on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops
[09:02:57] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[09:03:01] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={0,1} site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&v
[09:03:01] <icinga-wm>	 ad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[09:05:43] <icinga-wm>	 PROBLEM - DPKG on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[09:08:15] <icinga-wm>	 PROBLEM - MD RAID on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:08:21] <icinga-wm>	 PROBLEM - IPMI Sensor Status on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[09:08:35] <icinga-wm>	 PROBLEM - configured eth on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[09:14:58] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[09:16:17] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:37] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:19:05] <icinga-wm>	 RECOVERY - MD RAID on kubernetes2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:23:23] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[09:32:37] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:35:49] <icinga-wm>	 PROBLEM - MD RAID on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:36:35] <icinga-wm>	 RECOVERY - DPKG on kubernetes2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[09:36:41] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[09:36:43] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[09:37:57] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:38:27] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton
[09:39:25] <icinga-wm>	 RECOVERY - configured eth on kubernetes2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[09:40:21] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[09:43:35] <icinga-wm>	 PROBLEM - SSH on kubernetes2004 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:44:03] <icinga-wm>	 PROBLEM - dhclient process on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[09:44:33] <icinga-wm>	 PROBLEM - configured eth on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[09:45:58] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[09:47:11] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2004 is CRITICAL: connect to address 10.192.48.71 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:47:17] <icinga-wm>	 RECOVERY - SSH on kubernetes2004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:47:47] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[09:49:39] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes2004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[09:49:41] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:57:31] <icinga-wm>	 RECOVERY - MD RAID on kubernetes2004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[10:07:33] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes2004 is OK: OK: synced at Sat 2020-07-11 10:07:32 UTC. https://wikitech.wikimedia.org/wiki/NTP
[10:14:55] <icinga-wm>	 RECOVERY - dhclient process on kubernetes2004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[10:15:21] <icinga-wm>	 RECOVERY - configured eth on kubernetes2004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[10:18:01] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:46:18] <icinga-wm>	 PROBLEM - Long running screen/tmux on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[12:20:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32909312 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:20:21] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 41820568 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:22:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 78 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:22:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 36824 and 79 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:36:59] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:40:43] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:02:21] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[14:06:55] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:08:51] <icinga-wm>	 RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:11:25] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1004 is OK: OK: synced at Sat 2020-07-11 14:11:23 UTC. https://wikitech.wikimedia.org/wiki/NTP
[14:13:13] <icinga-wm>	 RECOVERY - Disk space on kubernetes1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops
[14:16:48] <icinga-wm>	 RECOVERY - IPMI Sensor Status on kubernetes1004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:24:19] <icinga-wm>	 RECOVERY - DPKG on kubernetes1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[14:25:19] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:25:21] <icinga-wm>	 RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[14:26:25] <icinga-wm>	 RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[14:47:08] <icinga-wm>	 RECOVERY - Long running screen/tmux on kubernetes1004 is OK: OK: Tmux detected but not long running. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[15:14:20] <wikibugs>	 (03PS3) 10Urbanecm: Initial configuration for sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609768 (https://phabricator.wikimedia.org/T256545)
[15:19:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594214 (owner: 10Jforrester)
[16:27:19] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:28:47] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:11] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:21:39] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[18:21:51] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:21:55] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:21:55] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[18:21:55] <icinga-wm>	 PROBLEM - MD RAID on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[18:29:44] <wikibugs>	 (03PS1) 10QChris: Drop gerrit1002 from targets [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611804
[18:29:46] <wikibugs>	 (03PS1) 10QChris: Bump gerrit.war to Gerrit 3.2.2-138-g230805407f [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611805 (https://phabricator.wikimedia.org/T257186)
[18:29:49] <wikibugs>	 (03PS1) 10QChris: Bump zuul plugin to master-12-ge51d7e8 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611806 (https://phabricator.wikimedia.org/T257198)
[18:34:41] <icinga-wm>	 PROBLEM - DPKG on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[18:35:05] <icinga-wm>	 PROBLEM - Disk space on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops
[18:35:43] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:35:45] <icinga-wm>	 PROBLEM - configured eth on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[18:36:30] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Drop gerrit1002 from targets [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611804 (owner: 10QChris)
[18:36:35] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Bump gerrit.war to Gerrit 3.2.2-138-g230805407f [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611805 (https://phabricator.wikimedia.org/T257186) (owner: 10QChris)
[18:36:42] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Bump zuul plugin to master-12-ge51d7e8 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/611806 (https://phabricator.wikimedia.org/T257198) (owner: 10QChris)
[18:36:51] <icinga-wm>	 PROBLEM - dhclient process on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[18:45:15] <icinga-wm>	 PROBLEM - IPMI Sensor Status on kubernetes1004 is CRITICAL: connect to address 10.64.48.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:50:25] <wikibugs>	 (03CR) 10QChris: "The CI table machinery does not rely the prepended jobname. Agreed." [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar)
[18:55:35] <logmsgbot>	 !log qchris@deploy1001 Started deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit2001
[18:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:45] <logmsgbot>	 !log qchris@deploy1001 Finished deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit2001 (duration: 00m 10s)
[18:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:35] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[19:00:49] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:02:49] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[19:05:17] <icinga-wm>	 RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:05:31] <icinga-wm>	 RECOVERY - DPKG on kubernetes1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[19:06:33] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:06:37] <icinga-wm>	 RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[19:07:41] <icinga-wm>	 RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[19:08:38] <qchris>	 !log Restarting Gerrit on gerrit2001 to switch to new gerrit.war and zuul plugin
[19:08:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:53] <logmsgbot>	 !log qchris@deploy1001 Started deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit1001
[19:15:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:00] <logmsgbot>	 !log qchris@deploy1001 Finished deploy [gerrit/gerrit@a71a0df]: Gerrit to v3.2.2-138-g230805407f and zuul plugin to master-12-ge51d7e8 on gerrit1001 (duration: 00m 07s)
[19:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:05] <icinga-wm>	 RECOVERY - IPMI Sensor Status on kubernetes1004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:16:36] <qchris>	 !log Restarting Gerrit on gerrit1001 to switch to new gerrit.war and zuul plugin
[19:16:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:45] <icinga-wm>	 RECOVERY - Disk space on kubernetes1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes1004&var-datasource=eqiad+prometheus/ops
[19:23:37] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes1004 is OK: OK: synced at Sat 2020-07-11 19:23:35 UTC. https://wikitech.wikimedia.org/wiki/NTP
[19:39:43] <Platonides>	 I don't know who chose that format name
[19:39:55] <Platonides>	 who would want a gerrit war? :P
[20:19:05] <Nemo_bis>	 Platonides: heh for a moment I thought there was such a threat
[20:26:09] <Platonides>	 :D
[20:27:02] <Platonides>	 I read above "new gerrit war"...
[21:18:05] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:19:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets