[00:12:19] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [00:14:35] PROBLEM - Disk space on analytics1039 is CRITICAL: DISK CRITICAL - free space: / 1536 MB (2% inode=97%) [00:29:15] PROBLEM - Disk space on analytics1039 is CRITICAL: DISK CRITICAL - free space: / 467 MB (0% inode=97%) [00:42:47] PROBLEM - Hadoop NodeManager on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [00:59:05] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CRITICAL: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: LOST [00:59:53] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:12:23] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [01:47:29] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:27] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.038 second response time [02:24:03] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Unable to delete certain files due to "inconsistent state within the internal storage backends" - https://phabricator.wikimedia.org/T141704 (10Fastily) When I try to delete [[ https://en.wikipedia.org/wiki/File:Dr._Richard_Pi... [02:31:39] PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100% [02:32:27] RECOVERY - Host mw1272 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [03:36:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 930.96 seconds [03:37:01] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:46:26] !log Starting equalization translations for Serbian Cyrillic (sr-ec) and Serbian Latin (sr-el). Possible lags on translatewiki.net due to excessive number of operations [03:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:38] (oops.. wrong channel) [04:02:43] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:06:21] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 186.00 seconds [04:48:41] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [04:49:15] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [04:49:17] PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [04:49:25] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [04:49:29] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [04:49:37] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [04:49:39] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [05:07:19] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [05:09:01] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [05:09:35] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [05:09:37] RECOVERY - Disk space on notebook1004 is OK: DISK OK [05:09:45] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [05:09:47] RECOVERY - DPKG on notebook1004 is OK: All packages OK [05:09:57] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [05:10:07] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:37:27] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1004 is OK: OK: synced at Sun 2018-11-18 05:37:26 UTC. [05:55:59] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [05:56:01] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [05:56:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:57:01] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [05:57:01] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [06:08:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:24:09] PROBLEM - Check size of conntrack table on elastic2021 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.33: Connection reset by peer [06:24:15] PROBLEM - DPKG on elastic2021 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.33: Connection reset by peer [06:25:59] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [06:26:03] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [06:26:13] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [06:26:27] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [06:26:57] PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [06:26:58] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [06:27:19] PROBLEM - Check systemd state on elastic2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:27:29] RECOVERY - Check size of conntrack table on elastic2021 is OK: OK: nf_conntrack is 0 % full [06:27:29] PROBLEM - MD RAID on elastic2021 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [06:27:36] 10Operations, 10ops-codfw: Degraded RAID on elastic2021 - https://phabricator.wikimedia.org/T209779 (10ops-monitoring-bot) [06:27:37] RECOVERY - DPKG on elastic2021 is OK: All packages OK [06:28:13] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:28:31] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/mark] [06:29:21] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [06:39:25] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [06:39:25] RECOVERY - Disk space on notebook1004 is OK: DISK OK [06:39:35] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:39:35] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [06:39:39] RECOVERY - DPKG on notebook1004 is OK: All packages OK [06:39:49] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [06:40:03] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [06:45:13] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:58:55] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:59:13] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:12:21] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [08:32:41] 10Operations, 10Citoid, 10Patch-For-Review, 10Service-deployment-requests, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10Mvolz) >>! In T201611#4755783, @akosiaris wrote: > This has now been deployed to the kubernetes staging cluster. > > ` > akosiaris@deploy100... [08:37:13] RECOVERY - Hadoop NodeManager on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [08:37:50] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [08:40:29] RECOVERY - Disk space on analytics1039 is OK: DISK OK [08:47:50] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Now in the better connected city, 15 km. away, using https://en.wikipedia.org/wiki/FarEasTone I get {F27249934} No I don't know why... [09:00:12] !log cleaned up analytics1039 and restarted Yarn [09:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:55] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:34:43] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [09:40:17] 10Operations, 10Parsoid: parsoid-rt.service keeps failing on ruthenium causing alerts in icinga - https://phabricator.wikimedia.org/T209781 (10elukey) [09:40:44] 10Operations, 10Parsoid: parsoid-rt.service keeps failing on ruthenium causing alerts in icinga - https://phabricator.wikimedia.org/T209781 (10elukey) [10:10:23] <_joe_> elukey: you ticket is a duplicate [10:12:13] 10Operations, 10Parsoid: parsoid-rt.service keeps failing on ruthenium causing alerts in icinga - https://phabricator.wikimedia.org/T209781 (10Joe) [10:12:17] 10Operations, 10Parsoid: parsoid-rt repeated failures on ruthenium (parsoid::testing) - https://phabricator.wikimedia.org/T209758 (10Joe) [10:20:01] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:28:19] 10Operations, 10Performance-Team: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe) >>! In T206341#4753952, @Imarlier wrote: > @Joe Might be interesting to look at specific calls that appear to perform less well, to see if we can identify specific... [10:31:51] <_joe_> . [10:41:31] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [10:54:17] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) @Nemo_bis And here is the super fast HiNet connection in the city, which I am even more sure always succeeds uploading. {F27251642} [11:01:16] _joe_: I looked for 'ruthenium' in phab and didn't see anything, so I opened one, that's it :) [11:01:19] thanks for the merge [11:55:07] PROBLEM - puppet last run on elastic2021 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [12:00:00] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10jcrespo) > i think we should normally not use this method (disable notifications) and instead "schedule downtime" was the better option in the... [12:28:51] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:15] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [12:56:05] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 3106 mails in exim queue. [14:23:36] 10Operations, 10Traffic: INMARSAT geolocates to the UK, leading to requests going to esams - https://phabricator.wikimedia.org/T209785 (10Reedy) [14:31:55] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:40:59] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [15:17:11] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:40:55] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [16:00:09] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:06:41] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:11:27] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [16:26:53] RECOVERY - exim queue on mx1001 is OK: OK: Less than 1000 mails in exim queue. [16:28:39] PROBLEM - Host cp1078 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:17] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:34:41] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:34:47] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:34:55] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:34:55] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:34:55] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:34:55] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:34:57] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:03] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:03] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:11] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:11] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:11] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:11] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:13] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:13] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:19] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:19] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:23] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:23] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:23] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:25] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:29] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:37] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:41] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:41] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:47] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:47] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:47] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:47] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6 [16:35:51] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:35:51] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:36:01] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:36:03] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:36:03] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:36:07] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:36:07] PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6 [16:59:03] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:11:31] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [17:15:30] Anybody ^ [17:21:53] I don't know what that is but I'll see what i can see... [17:22:32] something up with cp1078? [17:23:13] I think this has come up before, should probably be something on wikitech about it [17:25:16] yeah, cp1078 is completely down [17:25:46] 18<icinga-wm18> PROBLEM - Host cp1078 is DOWN: PING CRITICAL - Packet loss = 100% [17:25:50] missed that one [17:26:19] !log restarting cp1078 from mgmt console [17:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:56] at least it's not IPsec/Strongswan itself :) [17:30:17] RECOVERY - Host cp1078 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:30:17] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [17:30:17] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [17:30:19] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [17:30:19] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 40 ESP OK [17:30:19] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 40 ESP OK [17:30:25] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 40 ESP OK [17:30:25] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 40 ESP OK [17:30:29] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 40 ESP OK [17:30:29] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 40 ESP OK [17:30:35] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [17:30:41] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [17:30:41] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK [17:30:43] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 40 ESP OK [17:30:43] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 40 ESP OK [17:30:45] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 40 ESP OK [17:30:45] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 40 ESP OK [17:30:45] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 40 ESP OK [17:30:51] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [17:30:51] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [17:30:51] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 40 ESP OK [17:30:51] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 40 ESP OK [17:30:53] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 40 ESP OK [17:30:55] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 40 ESP OK [17:30:55] RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 40 ESP OK [17:30:57] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [17:30:57] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 40 ESP OK [17:30:57] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 40 ESP OK [17:30:59] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [17:31:03] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 40 ESP OK [17:31:03] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 40 ESP OK [17:31:05] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 64 ESP OK [17:31:15] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 40 ESP OK [17:31:15] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 40 ESP OK [17:31:15] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 40 ESP OK [17:31:15] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [17:31:21] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 40 ESP OK [17:32:57] 10Operations: cp1078 crash - https://phabricator.wikimedia.org/T209791 (10Andrew) [17:36:56] 10Operations: cp1078 crash - https://phabricator.wikimedia.org/T209791 (10Andrew) The syslog suggests that the box wasn't actually all the way down before I restarted it, just in distress. Here are its last few minutes: ` Nov 18 17:25:52 cp1078 confd[1060]: 2018-11-18T17:25:52Z cp1078 /usr/bin/confd[1060]: ER... [17:39:48] 10Operations: cp1078 crash - https://phabricator.wikimedia.org/T209791 (10Andrew) ` Nov 18 16:27:31 cp1078 kernel: [1997356.199019] ------------[ cut here ]------------ Nov 18 16:27:31 cp1078 kernel: [1997356.199032] WARNING: CPU: 5 PID: 0 at /build/linux-IWeKxA/linux-4.9.110/net/sched/sch_generic.c:316 dev_watc... [17:44:10] ok, this seems resolved for now, I'll leave parsing the kernel trace to those who enjoy such things :) [17:51:50] <_joe_> just to be clear: one cache host going down can be withstood by our systems pretty well [17:52:04] <_joe_> it's just a ton of alerts if it's an eqiad host [19:46:12] that indeed was a ton of alerts [19:55:08] addshore, yeah a lot of hosts try to set up IPsec connections with a lot of other hosts [19:55:23] if one breaks then you get a lot of alerts [20:01:08] Gotcha [20:13:04] 10Operations, 10ops-eqiad, 10Traffic: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack) [20:13:06] 10Operations: cp1078 crash - https://phabricator.wikimedia.org/T209791 (10BBlack) [20:13:28] 10Operations, 10ops-eqiad, 10Traffic: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack) Yet another! cp1078 crash ticket above merged into here. [20:14:38] luckily, we configure the cp (cacheproxy) clusters in such a way that loss of one or more servers is rarely a critical thing! [20:15:33] ob https://xkcd.com/1737/ [20:22:11] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:42:03] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [21:01:09] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:12:15] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [21:52:25] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27979 MB (5% inode=99%) [21:53:31] RECOVERY - Disk space on elastic1017 is OK: DISK OK [22:33:01] 10Operations, 10Wikimedia-Mailing-lists: Need to shut down a list - https://phabricator.wikimedia.org/T209726 (10Beeblebrox) my apologies for not including the name, which as it turns out was not easy to find, but it's this one https://lists.wikimedia.org/mailman/listinfo/mediation-en-l [22:33:40] 10Operations, 10Wikimedia-Mailing-lists: Need to shut down a list, mediation-en-l - https://phabricator.wikimedia.org/T209726 (10Beeblebrox) [22:35:17] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [22:35:24] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [22:35:27] PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [22:35:31] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [22:35:51] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [22:36:15] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [22:38:39] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [22:39:37] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [22:39:47] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [22:39:53] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [22:39:57] RECOVERY - Disk space on notebook1004 is OK: DISK OK [22:40:01] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [22:40:21] RECOVERY - DPKG on notebook1004 is OK: All packages OK [22:43:47] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:01:15] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:02:11] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:02:17] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:02:19] PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:02:45] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:03:05] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:06:27] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused [23:09:45] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up [23:09:57] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient [23:10:03] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:10:05] RECOVERY - Disk space on notebook1004 is OK: DISK OK [23:10:09] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [23:10:31] RECOVERY - DPKG on notebook1004 is OK: All packages OK [23:11:35] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures