[00:01:09] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:01:31] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:01:57] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 9 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:02:45] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:05:39] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:07:01] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:07:06] <icinga-wm>	 PROBLEM - MariaDB disk space #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[00:07:11] <icinga-wm>	 PROBLEM - dump of s7 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:07:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:07:35] <icinga-wm>	 PROBLEM - DPKG on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[00:07:39] <icinga-wm>	 PROBLEM - configured eth on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[00:07:41] <icinga-wm>	 PROBLEM - Check systemd state on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:41] <icinga-wm>	 PROBLEM - dhclient process on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[00:07:55] <icinga-wm>	 PROBLEM - Check size of conntrack table on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[00:08:19] <icinga-wm>	 PROBLEM - MD RAID on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[00:08:28] <icinga-wm>	 PROBLEM - mysqld processes #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[00:08:33] <icinga-wm>	 PROBLEM - Disk space on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1115&var-datasource=eqiad+prometheus/ops
[00:08:55] <icinga-wm>	 PROBLEM - dump of s8 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:09:11] <icinga-wm>	 PROBLEM - snapshot of s2 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:09:47] <icinga-wm>	 PROBLEM - snapshot of s8 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:10:29] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:10:49] <icinga-wm>	 PROBLEM - dump of s7 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:11:17] <icinga-wm>	 PROBLEM - puppet last run on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:11:27] <icinga-wm>	 PROBLEM - dump of m3 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:13:05] <icinga-wm>	 PROBLEM - dump of s4 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:13:05] <icinga-wm>	 PROBLEM - dump of s4 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:13:37] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:13:45] <icinga-wm>	 PROBLEM - dump of m1 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:13:53] <icinga-wm>	 PROBLEM - dump of m1 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:14:03] <icinga-wm>	 PROBLEM - dump of s5 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:14:09] <icinga-wm>	 PROBLEM - snapshot of s8 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:14:41] <icinga-wm>	 RECOVERY - MD RAID on db1115 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[00:14:50] <icinga-wm>	 RECOVERY - mysqld processes #page on db1115 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[00:14:53] <icinga-wm>	 RECOVERY - Disk space on db1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1115&var-datasource=eqiad+prometheus/ops
[00:14:59] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on db1115 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:15:04] <icinga-wm>	 RECOVERY - MariaDB disk space #page on db1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[00:15:33] <icinga-wm>	 RECOVERY - DPKG on db1115 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[00:15:37] <icinga-wm>	 RECOVERY - configured eth on db1115 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[00:15:37] <icinga-wm>	 RECOVERY - Check systemd state on db1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:39] <icinga-wm>	 RECOVERY - dhclient process on db1115 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[00:15:53] <icinga-wm>	 RECOVERY - Check size of conntrack table on db1115 is OK: OK: nf_conntrack is 5 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[00:16:51] <icinga-wm>	 RECOVERY - puppet last run on db1115 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[00:18:21] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:18:25] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:19:55] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:21:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:22:33] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:23:21] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:23:43] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:24:55] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:27:15] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:29:21] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:29:39] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:30:01] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:31:37] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:31:59] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:32:47] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:33:33] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 9 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:34:05] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:35:09] <herron>	 !log set icinga downtimes on flapping cr2-eqiad and cr2-codfw alerts until monday
[00:35:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:47] <icinga-wm>	 RECOVERY - dump of s7 in codfw on db1115 is OK: dump for s7 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:31:51 from db2100.codfw.wmnet:3317 (110 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:39:29] <icinga-wm>	 RECOVERY - dump of s8 in codfw on db1115 is OK: dump for s8 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 03:19:04 from db2100.codfw.wmnet:3318 (140 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:39:45] <icinga-wm>	 RECOVERY - snapshot of s2 in eqiad on db1115 is OK: snapshot for s2 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-08-23 03:24:09 from db1095.eqiad.wmnet:3312 (830 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:40:21] <icinga-wm>	 RECOVERY - snapshot of s8 in eqiad on db1115 is OK: snapshot for s8 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-08-22 21:22:35 from db1116.eqiad.wmnet:3318 (1507 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:41:23] <icinga-wm>	 RECOVERY - dump of s7 in eqiad on db1115 is OK: dump for s7 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:00:01 from db1116.eqiad.wmnet:3317 (110 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:42:01] <icinga-wm>	 RECOVERY - dump of m3 in codfw on db1115 is OK: dump for m3 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 01:36:55 from db2078.codfw.wmnet:3323 (41 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:43:39] <icinga-wm>	 RECOVERY - dump of s4 in codfw on db1115 is OK: dump for s4 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:00:02 from db2099.codfw.wmnet:3314 (117 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:43:39] <icinga-wm>	 RECOVERY - dump of s4 in eqiad on db1115 is OK: dump for s4 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:00:01 from db1102.eqiad.wmnet:3314 (117 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:44:17] <icinga-wm>	 RECOVERY - dump of m1 in codfw on db1115 is OK: dump for m1 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:00:01 from db2078.codfw.wmnet:3321 (13 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:44:27] <icinga-wm>	 RECOVERY - dump of m1 in eqiad on db1115 is OK: dump for m1 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 02:49:37 from db1117.eqiad.wmnet:3321 (13 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:44:37] <icinga-wm>	 RECOVERY - dump of s5 in codfw on db1115 is OK: dump for s5 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 04:42:18 from db2099.codfw.wmnet:3315 (99 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[00:44:41] <icinga-wm>	 RECOVERY - snapshot of s8 in codfw on db1115 is OK: snapshot for s8 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-22 20:50:22 from db2100.codfw.wmnet:3318 (1222 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[02:21:25] <icinga-wm>	 PROBLEM - snapshot of s5 in codfw on db1115 is CRITICAL: snapshot for s5 at codfw taken more than 4 days ago: Most recent backup 2019-08-21 01:58:55 https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[02:22:21] <cdanis>	 !log clear downtimes on cr2-eqiad/cr2-codfw, link supposedly stable now 
[02:22:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:54:35] <icinga-wm>	 PROBLEM - snapshot of s7 in codfw on db1115 is CRITICAL: snapshot for s7 at codfw taken more than 4 days ago: Most recent backup 2019-08-21 03:47:27 https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[04:05:19] <icinga-wm>	 PROBLEM - snapshot of s6 in codfw on db1115 is CRITICAL: snapshot for s6 at codfw taken more than 4 days ago: Most recent backup 2019-08-21 03:34:39 https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[09:02:48] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.27 [software/spicerack] - 10https://gerrit.wikimedia.org/r/532221
[09:08:49] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.27 [software/spicerack] - 10https://gerrit.wikimedia.org/r/532221 (owner: 10Volans)
[09:13:00] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.27 [software/spicerack] - 10https://gerrit.wikimedia.org/r/532221 (owner: 10Volans)
[09:14:02] <wikibugs>	 (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.27 [software/spicerack] - 10https://gerrit.wikimedia.org/r/532221 (owner: 10Volans)
[09:16:12] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.0.27 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/532222
[09:20:52] <wikibugs>	 (03PS1) 10Volans: setup.py: add missing PyYAML dependency [software/homer] - 10https://gerrit.wikimedia.org/r/532223 (https://phabricator.wikimedia.org/T228388)
[09:20:54] <wikibugs>	 (03PS1) 10Volans: doc: add configuration example in documentation [software/homer] - 10https://gerrit.wikimedia.org/r/532224 (https://phabricator.wikimedia.org/T228388)
[09:20:56] <wikibugs>	 (03PS1) 10Volans: Configuration: load and merge private config [software/homer] - 10https://gerrit.wikimedia.org/r/532225 (https://phabricator.wikimedia.org/T228388)
[09:20:58] <wikibugs>	 (03PS1) 10Volans: devices: add query capability [software/homer] - 10https://gerrit.wikimedia.org/r/532226 (https://phabricator.wikimedia.org/T228388)
[09:21:00] <wikibugs>	 (03PS1) 10Volans: cli: rename action compile to generate [software/homer] - 10https://gerrit.wikimedia.org/r/532227 (https://phabricator.wikimedia.org/T228388)
[09:21:12] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v0.0.27 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/532222 (owner: 10Volans)
[09:25:19] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v0.0.27 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/532222 (owner: 10Volans)
[09:45:14] <wikibugs>	 10Operations, 10Analytics, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10Nuria) 05Open→03Resolved
[10:57:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[10:58:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:31:57] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[12:35:05] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[13:46:08] <volans>	 !log uploaded spicerack_0.0.27-1_amd64.deb to apt.wikimedia.org stretch-wikimedia
[13:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:25] <icinga-wm>	 PROBLEM - mysqld processes #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[14:48:45] <marostegui>	 checking
[14:48:53] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:49:13] <marostegui>	 Looks like OOM
[14:49:15] <icinga-wm>	 PROBLEM - dhclient process on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[14:49:17] <icinga-wm>	 PROBLEM - MD RAID on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:49:21] <icinga-wm>	 PROBLEM - Check size of conntrack table on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[14:49:25] <icinga-wm>	 PROBLEM - configured eth on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[14:49:26] <_joe_>	 Out of memory: Kill process 25675 (mysqld) score 380 or sacrifice child
[14:49:27] <icinga-wm>	 PROBLEM - Check systemd state on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:49:28] <_joe_>	 yes
[14:49:41] <icinga-wm>	 PROBLEM - DPKG on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[14:49:43] <icinga-wm>	 PROBLEM - Disk space on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1115&var-datasource=eqiad+prometheus/ops
[14:49:44] <_joe_>	 because ofc the oom killer will kill that
[14:49:47] <icinga-wm>	 PROBLEM - MariaDB disk space #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[14:50:00] <marostegui>	 again the same thing
[14:50:02] <marostegui>	 that last night
[14:50:11] <icinga-wm>	 PROBLEM - snapshot of s6 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[14:50:11] <icinga-wm>	 PROBLEM - dump of s2 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[14:50:18] <marostegui>	 but last night it didn't have any OOM
[14:50:25] <icinga-wm>	 PROBLEM - dump of m5 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[14:50:25] <icinga-wm>	 PROBLEM - snapshot of s2 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[14:51:06] <_joe_>	 right now it has 12 GB used
[14:51:36] <marostegui>	 mysql is now doing recovery
[14:51:50] <godog>	 I'm here btw, in case help is needed
[14:52:03] <icinga-wm>	 PROBLEM - dump of s3 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[14:52:33] <_joe_>	 the memory used by running processes is definitely too high on that server
[14:52:37] <icinga-wm>	 PROBLEM - dump of s1 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[14:52:42] <marostegui>	 going to downtime the host for now to avoid more pages
[14:52:54] <XioNoX>	 thx for working on a Sunday, still doing wedding stuff but with laptop and can be online if needed
[14:52:56] <_joe_>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1115&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=now-90d&to=now&refresh=5m&panelId=4&fullscreen
[14:53:07] <_joe_>	 same thing happened in june
[14:53:08] <marostegui>	 did someone opened a ticket for the issue last night?
[14:53:17] <_joe_>	 marostegui: I doubt it
[14:53:19] <marostegui>	 _joe_: yeah, tendril suffers stuff like this from time to time
[14:53:31] <_joe_>	 oh that's the tendril db?
[14:53:39] <marostegui>	 yep
[14:53:41] <_joe_>	 can't we tune it a bit down?
[14:53:46] <marostegui>	 we already did
[14:53:53] <marostegui>	 ok, mysql is back up
[14:54:00] <marostegui>	 can someone confirm tendril or dbtree works?
[14:54:02] <marostegui>	 I am checking HW logs
[14:54:10] <_joe_>	 dbtree works
[14:54:18] <_joe_>	 marostegui: it was just a very clear case of oom
[14:54:27] <_joe_>	 we shall have some alert on servers swapping
[14:55:12] <marostegui>	 _joe_: but from yesterday it wasn't an OOM, mysql never went down
[14:55:18] <_joe_>	 trying a few things on tendril
[14:55:24] <marostegui>	 HW logs are clean
[14:55:25] <_joe_>	 marostegui: yeah the server was just swapping
[14:55:53] <marostegui>	 tendril seems to be working fine indeed
[14:55:56] <marostegui>	 I am going to create a task
[14:56:13] <marostegui>	 And we can all enjoy the Sunday and check tomorrow :)
[14:56:17] <_joe_>	 yeah
[14:56:35] <marostegui>	 Thanks for responding!
[14:56:40] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] cli: rename action compile to generate [software/homer] - 10https://gerrit.wikimedia.org/r/532227 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans)
[14:56:42] <marostegui>	 Thanks also godog and XioNoX :*
[14:59:30] <volans>	 sorry I'm late I was on the phone and didn't notice the sms
[15:00:01] <marostegui>	 volans, thanks! already under control! go back to your sunday
[15:01:38] <marostegui>	 https://phabricator.wikimedia.org/T231165
[15:02:02] <volans>	 thx
[15:04:12] <marostegui>	 thanks guys! I am going to go off!
[15:14:45] <icinga-wm>	 RECOVERY - DPKG on db1115 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[15:14:47] <icinga-wm>	 RECOVERY - Disk space on db1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1115&var-datasource=eqiad+prometheus/ops
[15:14:51] <icinga-wm>	 RECOVERY - MariaDB disk space #page on db1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[15:15:07] <icinga-wm>	 RECOVERY - mysqld processes #page on db1115 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[15:15:35] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on db1115 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:15:57] <icinga-wm>	 RECOVERY - dhclient process on db1115 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[15:15:59] <icinga-wm>	 RECOVERY - MD RAID on db1115 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[15:16:05] <icinga-wm>	 RECOVERY - Check size of conntrack table on db1115 is OK: OK: nf_conntrack is 4 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[15:16:09] <icinga-wm>	 RECOVERY - configured eth on db1115 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[15:16:11] <icinga-wm>	 RECOVERY - Check systemd state on db1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:20:45] <icinga-wm>	 RECOVERY - dump of s2 in eqiad on db1115 is OK: dump for s2 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:05:26 from db1095.eqiad.wmnet:3312 (118 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[15:20:45] <icinga-wm>	 RECOVERY - snapshot of s6 in eqiad on db1115 is OK: snapshot for s6 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-08-23 07:14:01 from db1139.eqiad.wmnet:3316 (499 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[15:20:59] <icinga-wm>	 RECOVERY - dump of m5 in eqiad on db1115 is OK: dump for m5 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 02:43:56 from db1117.eqiad.wmnet:3325 (13 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[15:20:59] <icinga-wm>	 RECOVERY - snapshot of s2 in codfw on db1115 is OK: snapshot for s2 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-23 01:03:46 from db2098.codfw.wmnet:3312 (777 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[15:22:39] <icinga-wm>	 RECOVERY - dump of s3 in eqiad on db1115 is OK: dump for s3 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 03:34:36 from db1095.eqiad.wmnet:3313 (94 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[15:23:11] <icinga-wm>	 RECOVERY - dump of s1 in codfw on db1115 is OK: dump for s1 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:00:01 from db2097.codfw.wmnet:3311 (148 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[16:32:00] <wikibugs>	 (03CR) 10Krinkle: CommonSettings: Clean up wmf-config caching code [no-op] (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528446 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle)
[17:35:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[17:36:35] <icinga-wm>	 RECOVERY - HHVM rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 74552 bytes in 0.305 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:34:25] <wikibugs>	 (03PS1) 10Krinkle: Avoid localised url computation for P3P headers from CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532268 (https://phabricator.wikimedia.org/T189966)
[21:00:39] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T231176 (10ops-monitoring-bot)
[22:11:19] <wikibugs>	 (03PS1) 10DannyS712: General cleanup of initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178)
[22:24:42] <wikibugs>	 (03PS2) 10DannyS712: General cleanup of initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178)
[22:25:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] General cleanup of initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712)
[22:30:35] <wikibugs>	 (03CR) 10DannyS712: "Inline explanations provided for all non-whitespace changes; tests appear to be failing due to unrelated changes" (039 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712)
[22:35:04] <wikibugs>	 (03CR) 10Krinkle: General cleanup of initialise settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712)