[00:01:09] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:31] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:01:57] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 9 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:02:45] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:05:39] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:07:01] PROBLEM - Check whether ferm is active by checking the default input chain on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:07:06] PROBLEM - MariaDB disk space #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:07:11] PROBLEM - dump of s7 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:07:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:35] PROBLEM - DPKG on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:07:39] PROBLEM - configured eth on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:07:41] PROBLEM - Check systemd state on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:41] PROBLEM - dhclient process on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:07:55] PROBLEM - Check size of conntrack table on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:08:19] PROBLEM - MD RAID on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:08:28] PROBLEM - mysqld processes #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:08:33] PROBLEM - Disk space on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1115&var-datasource=eqiad+prometheus/ops [00:08:55] PROBLEM - dump of s8 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:09:11] PROBLEM - snapshot of s2 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:09:47] PROBLEM - snapshot of s8 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:10:29] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:10:49] PROBLEM - dump of s7 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:11:17] PROBLEM - puppet last run on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:11:27] PROBLEM - dump of m3 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:13:05] PROBLEM - dump of s4 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:13:05] PROBLEM - dump of s4 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:13:37] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:13:45] PROBLEM - dump of m1 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:13:53] PROBLEM - dump of m1 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:14:03] PROBLEM - dump of s5 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:14:09] PROBLEM - snapshot of s8 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:14:41] RECOVERY - MD RAID on db1115 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:14:50] RECOVERY - mysqld processes #page on db1115 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:14:53] RECOVERY - Disk space on db1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1115&var-datasource=eqiad+prometheus/ops [00:14:59] RECOVERY - Check whether ferm is active by checking the default input chain on db1115 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:15:04] RECOVERY - MariaDB disk space #page on db1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:15:33] RECOVERY - DPKG on db1115 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:15:37] RECOVERY - configured eth on db1115 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:15:37] RECOVERY - Check systemd state on db1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:39] RECOVERY - dhclient process on db1115 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:15:53] RECOVERY - Check size of conntrack table on db1115 is OK: OK: nf_conntrack is 5 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:16:51] RECOVERY - puppet last run on db1115 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:18:21] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:18:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:19:55] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:21:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:22:33] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:23:21] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:23:43] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:24:55] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:27:15] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:29:21] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:29:39] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:30:01] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:31:37] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:31:59] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:32:47] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:33:33] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 9 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:34:05] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:35:09] !log set icinga downtimes on flapping cr2-eqiad and cr2-codfw alerts until monday [00:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:47] RECOVERY - dump of s7 in codfw on db1115 is OK: dump for s7 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:31:51 from db2100.codfw.wmnet:3317 (110 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:39:29] RECOVERY - dump of s8 in codfw on db1115 is OK: dump for s8 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 03:19:04 from db2100.codfw.wmnet:3318 (140 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:39:45] RECOVERY - snapshot of s2 in eqiad on db1115 is OK: snapshot for s2 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-08-23 03:24:09 from db1095.eqiad.wmnet:3312 (830 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:40:21] RECOVERY - snapshot of s8 in eqiad on db1115 is OK: snapshot for s8 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-08-22 21:22:35 from db1116.eqiad.wmnet:3318 (1507 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:41:23] RECOVERY - dump of s7 in eqiad on db1115 is OK: dump for s7 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:00:01 from db1116.eqiad.wmnet:3317 (110 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:42:01] RECOVERY - dump of m3 in codfw on db1115 is OK: dump for m3 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 01:36:55 from db2078.codfw.wmnet:3323 (41 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:43:39] RECOVERY - dump of s4 in codfw on db1115 is OK: dump for s4 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:00:02 from db2099.codfw.wmnet:3314 (117 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:43:39] RECOVERY - dump of s4 in eqiad on db1115 is OK: dump for s4 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:00:01 from db1102.eqiad.wmnet:3314 (117 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:44:17] RECOVERY - dump of m1 in codfw on db1115 is OK: dump for m1 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:00:01 from db2078.codfw.wmnet:3321 (13 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:44:27] RECOVERY - dump of m1 in eqiad on db1115 is OK: dump for m1 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 02:49:37 from db1117.eqiad.wmnet:3321 (13 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:44:37] RECOVERY - dump of s5 in codfw on db1115 is OK: dump for s5 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 04:42:18 from db2099.codfw.wmnet:3315 (99 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:44:41] RECOVERY - snapshot of s8 in codfw on db1115 is OK: snapshot for s8 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-22 20:50:22 from db2100.codfw.wmnet:3318 (1222 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:21:25] PROBLEM - snapshot of s5 in codfw on db1115 is CRITICAL: snapshot for s5 at codfw taken more than 4 days ago: Most recent backup 2019-08-21 01:58:55 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:22:21] !log clear downtimes on cr2-eqiad/cr2-codfw, link supposedly stable now [02:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:35] PROBLEM - snapshot of s7 in codfw on db1115 is CRITICAL: snapshot for s7 at codfw taken more than 4 days ago: Most recent backup 2019-08-21 03:47:27 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:05:19] PROBLEM - snapshot of s6 in codfw on db1115 is CRITICAL: snapshot for s6 at codfw taken more than 4 days ago: Most recent backup 2019-08-21 03:34:39 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:02:48] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.27 [software/spicerack] - 10https://gerrit.wikimedia.org/r/532221 [09:08:49] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.27 [software/spicerack] - 10https://gerrit.wikimedia.org/r/532221 (owner: 10Volans) [09:13:00] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.27 [software/spicerack] - 10https://gerrit.wikimedia.org/r/532221 (owner: 10Volans) [09:14:02] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.27 [software/spicerack] - 10https://gerrit.wikimedia.org/r/532221 (owner: 10Volans) [09:16:12] (03PS1) 10Volans: Upstream release v0.0.27 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/532222 [09:20:52] (03PS1) 10Volans: setup.py: add missing PyYAML dependency [software/homer] - 10https://gerrit.wikimedia.org/r/532223 (https://phabricator.wikimedia.org/T228388) [09:20:54] (03PS1) 10Volans: doc: add configuration example in documentation [software/homer] - 10https://gerrit.wikimedia.org/r/532224 (https://phabricator.wikimedia.org/T228388) [09:20:56] (03PS1) 10Volans: Configuration: load and merge private config [software/homer] - 10https://gerrit.wikimedia.org/r/532225 (https://phabricator.wikimedia.org/T228388) [09:20:58] (03PS1) 10Volans: devices: add query capability [software/homer] - 10https://gerrit.wikimedia.org/r/532226 (https://phabricator.wikimedia.org/T228388) [09:21:00] (03PS1) 10Volans: cli: rename action compile to generate [software/homer] - 10https://gerrit.wikimedia.org/r/532227 (https://phabricator.wikimedia.org/T228388) [09:21:12] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.27 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/532222 (owner: 10Volans) [09:25:19] (03Merged) 10jenkins-bot: Upstream release v0.0.27 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/532222 (owner: 10Volans) [09:45:14] 10Operations, 10Analytics, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10Nuria) 05Open→03Resolved [10:57:05] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:58:31] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:31:57] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:35:05] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [13:46:08] !log uploaded spicerack_0.0.27-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [13:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:25] PROBLEM - mysqld processes #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:48:45] checking [14:48:53] PROBLEM - Check whether ferm is active by checking the default input chain on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:49:13] Looks like OOM [14:49:15] PROBLEM - dhclient process on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [14:49:17] PROBLEM - MD RAID on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:49:21] PROBLEM - Check size of conntrack table on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [14:49:25] PROBLEM - configured eth on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:49:26] <_joe_> Out of memory: Kill process 25675 (mysqld) score 380 or sacrifice child [14:49:27] PROBLEM - Check systemd state on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:28] <_joe_> yes [14:49:41] PROBLEM - DPKG on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:49:43] PROBLEM - Disk space on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1115&var-datasource=eqiad+prometheus/ops [14:49:44] <_joe_> because ofc the oom killer will kill that [14:49:47] PROBLEM - MariaDB disk space #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:50:00] again the same thing [14:50:02] that last night [14:50:11] PROBLEM - snapshot of s6 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:50:11] PROBLEM - dump of s2 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:50:18] but last night it didn't have any OOM [14:50:25] PROBLEM - dump of m5 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:50:25] PROBLEM - snapshot of s2 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:51:06] <_joe_> right now it has 12 GB used [14:51:36] mysql is now doing recovery [14:51:50] I'm here btw, in case help is needed [14:52:03] PROBLEM - dump of s3 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:52:33] <_joe_> the memory used by running processes is definitely too high on that server [14:52:37] PROBLEM - dump of s1 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:52:42] going to downtime the host for now to avoid more pages [14:52:54] thx for working on a Sunday, still doing wedding stuff but with laptop and can be online if needed [14:52:56] <_joe_> https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1115&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=now-90d&to=now&refresh=5m&panelId=4&fullscreen [14:53:07] <_joe_> same thing happened in june [14:53:08] did someone opened a ticket for the issue last night? [14:53:17] <_joe_> marostegui: I doubt it [14:53:19] _joe_: yeah, tendril suffers stuff like this from time to time [14:53:31] <_joe_> oh that's the tendril db? [14:53:39] yep [14:53:41] <_joe_> can't we tune it a bit down? [14:53:46] we already did [14:53:53] ok, mysql is back up [14:54:00] can someone confirm tendril or dbtree works? [14:54:02] I am checking HW logs [14:54:10] <_joe_> dbtree works [14:54:18] <_joe_> marostegui: it was just a very clear case of oom [14:54:27] <_joe_> we shall have some alert on servers swapping [14:55:12] _joe_: but from yesterday it wasn't an OOM, mysql never went down [14:55:18] <_joe_> trying a few things on tendril [14:55:24] HW logs are clean [14:55:25] <_joe_> marostegui: yeah the server was just swapping [14:55:53] tendril seems to be working fine indeed [14:55:56] I am going to create a task [14:56:13] And we can all enjoy the Sunday and check tomorrow :) [14:56:17] <_joe_> yeah [14:56:35] Thanks for responding! [14:56:40] (03CR) 10Ayounsi: [C: 03+1] cli: rename action compile to generate [software/homer] - 10https://gerrit.wikimedia.org/r/532227 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [14:56:42] Thanks also godog and XioNoX :* [14:59:30] sorry I'm late I was on the phone and didn't notice the sms [15:00:01] volans, thanks! already under control! go back to your sunday [15:01:38] https://phabricator.wikimedia.org/T231165 [15:02:02] thx [15:04:12] thanks guys! I am going to go off! [15:14:45] RECOVERY - DPKG on db1115 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:14:47] RECOVERY - Disk space on db1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1115&var-datasource=eqiad+prometheus/ops [15:14:51] RECOVERY - MariaDB disk space #page on db1115 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:15:07] RECOVERY - mysqld processes #page on db1115 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:15:35] RECOVERY - Check whether ferm is active by checking the default input chain on db1115 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:15:57] RECOVERY - dhclient process on db1115 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:15:59] RECOVERY - MD RAID on db1115 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:16:05] RECOVERY - Check size of conntrack table on db1115 is OK: OK: nf_conntrack is 4 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:16:09] RECOVERY - configured eth on db1115 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:16:11] RECOVERY - Check systemd state on db1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:45] RECOVERY - dump of s2 in eqiad on db1115 is OK: dump for s2 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:05:26 from db1095.eqiad.wmnet:3312 (118 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [15:20:45] RECOVERY - snapshot of s6 in eqiad on db1115 is OK: snapshot for s6 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-08-23 07:14:01 from db1139.eqiad.wmnet:3316 (499 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [15:20:59] RECOVERY - dump of m5 in eqiad on db1115 is OK: dump for m5 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 02:43:56 from db1117.eqiad.wmnet:3325 (13 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [15:20:59] RECOVERY - snapshot of s2 in codfw on db1115 is OK: snapshot for s2 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-23 01:03:46 from db2098.codfw.wmnet:3312 (777 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [15:22:39] RECOVERY - dump of s3 in eqiad on db1115 is OK: dump for s3 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 03:34:36 from db1095.eqiad.wmnet:3313 (94 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [15:23:11] RECOVERY - dump of s1 in codfw on db1115 is OK: dump for s1 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2019-08-20 00:00:01 from db2097.codfw.wmnet:3311 (148 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [16:32:00] (03CR) 10Krinkle: CommonSettings: Clean up wmf-config caching code [no-op] (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528446 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [17:35:09] PROBLEM - HHVM rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [17:36:35] RECOVERY - HHVM rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 74552 bytes in 0.305 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:34:25] (03PS1) 10Krinkle: Avoid localised url computation for P3P headers from CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532268 (https://phabricator.wikimedia.org/T189966) [21:00:39] 10Operations, 10ops-codfw: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T231176 (10ops-monitoring-bot) [22:11:19] (03PS1) 10DannyS712: General cleanup of initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) [22:24:42] (03PS2) 10DannyS712: General cleanup of initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) [22:25:41] (03CR) 10jerkins-bot: [V: 04-1] General cleanup of initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [22:30:35] (03CR) 10DannyS712: "Inline explanations provided for all non-whitespace changes; tests appear to be failing due to unrelated changes" (039 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [22:35:04] (03CR) 10Krinkle: General cleanup of initialise settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712)