[00:12:19] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[00:14:35] <icinga-wm>	 PROBLEM - Disk space on analytics1039 is CRITICAL: DISK CRITICAL - free space: / 1536 MB (2% inode=97%)
[00:29:15] <icinga-wm>	 PROBLEM - Disk space on analytics1039 is CRITICAL: DISK CRITICAL - free space: / 467 MB (0% inode=97%)
[00:42:47] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[00:59:05] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CRITICAL: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: LOST
[00:59:53] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:12:23] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[01:47:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:48:27] <icinga-wm>	 RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.038 second response time
[02:24:03] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Unable to delete certain files due to "inconsistent state within the internal storage backends" - https://phabricator.wikimedia.org/T141704 (10Fastily) When I try to delete [[ https://en.wikipedia.org/wiki/File:Dr._Richard_Pi...
[02:31:39] <icinga-wm>	 PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100%
[02:32:27] <icinga-wm>	 RECOVERY - Host mw1272 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[03:36:45] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 930.96 seconds
[03:37:01] <icinga-wm>	 PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:46:26] <Zoranzoki21>	 !log Starting equalization translations for Serbian Cyrillic (sr-ec) and Serbian Latin (sr-el). Possible lags on translatewiki.net due to excessive number of operations
[03:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:46:38] <Zoranzoki21>	 (oops.. wrong channel)
[04:02:43] <icinga-wm>	 RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[04:06:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 186.00 seconds
[04:48:41] <icinga-wm>	 PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[04:49:15] <icinga-wm>	 PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[04:49:17] <icinga-wm>	 PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[04:49:25] <icinga-wm>	 PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[04:49:29] <icinga-wm>	 PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[04:49:37] <icinga-wm>	 PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[04:49:39] <icinga-wm>	 PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[05:07:19] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[05:09:01] <icinga-wm>	 RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[05:09:35] <icinga-wm>	 RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient
[05:09:37] <icinga-wm>	 RECOVERY - Disk space on notebook1004 is OK: DISK OK
[05:09:45] <icinga-wm>	 RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational
[05:09:47] <icinga-wm>	 RECOVERY - DPKG on notebook1004 is OK: All packages OK
[05:09:57] <icinga-wm>	 RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up
[05:10:07] <icinga-wm>	 RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[05:37:27] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1004 is OK: OK: synced at Sun 2018-11-18 05:37:26 UTC.
[05:55:59] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[05:56:01] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received
[05:56:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:57:01] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[05:57:01] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[06:08:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:24:09] <icinga-wm>	 PROBLEM - Check size of conntrack table on elastic2021 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.33: Connection reset by peer
[06:24:15] <icinga-wm>	 PROBLEM - DPKG on elastic2021 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.33: Connection reset by peer
[06:25:59] <icinga-wm>	 PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[06:26:03] <icinga-wm>	 PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[06:26:13] <icinga-wm>	 PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[06:26:27] <icinga-wm>	 PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[06:26:57] <icinga-wm>	 PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[06:26:58] <icinga-wm>	 PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[06:27:19] <icinga-wm>	 PROBLEM - Check systemd state on elastic2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:27:29] <icinga-wm>	 RECOVERY - Check size of conntrack table on elastic2021 is OK: OK: nf_conntrack is 0 % full
[06:27:29] <icinga-wm>	 PROBLEM - MD RAID on elastic2021 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0
[06:27:36] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on elastic2021 - https://phabricator.wikimedia.org/T209779 (10ops-monitoring-bot)
[06:27:37] <icinga-wm>	 RECOVERY - DPKG on elastic2021 is OK: All packages OK
[06:28:13] <icinga-wm>	 PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs]
[06:28:31] <icinga-wm>	 PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/mark]
[06:29:21] <icinga-wm>	 PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[06:39:25] <icinga-wm>	 RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient
[06:39:25] <icinga-wm>	 RECOVERY - Disk space on notebook1004 is OK: DISK OK
[06:39:35] <icinga-wm>	 RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[06:39:35] <icinga-wm>	 RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational
[06:39:39] <icinga-wm>	 RECOVERY - DPKG on notebook1004 is OK: All packages OK
[06:39:49] <icinga-wm>	 RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up
[06:40:03] <icinga-wm>	 RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[06:45:13] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:58:55] <icinga-wm>	 RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[06:59:13] <icinga-wm>	 RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:12:21] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[08:32:41] <wikibugs>	 10Operations, 10Citoid, 10Patch-For-Review, 10Service-deployment-requests, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10Mvolz) >>! In T201611#4755783, @akosiaris wrote: > This has now been deployed to the kubernetes staging cluster.  >  > ` > akosiaris@deploy100...
[08:37:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[08:37:50] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[08:40:29] <icinga-wm>	 RECOVERY - Disk space on analytics1039 is OK: DISK OK
[08:47:50] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Now in the better connected city, 15 km. away, using https://en.wikipedia.org/wiki/FarEasTone I get {F27249934} No I don't know why...
[09:00:12] <elukey>	 !log cleaned up analytics1039 and restarted Yarn
[09:00:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:55] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:34:43] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[09:40:17] <wikibugs>	 10Operations, 10Parsoid: parsoid-rt.service keeps failing on ruthenium causing alerts in icinga - https://phabricator.wikimedia.org/T209781 (10elukey)
[09:40:44] <wikibugs>	 10Operations, 10Parsoid: parsoid-rt.service keeps failing on ruthenium causing alerts in icinga - https://phabricator.wikimedia.org/T209781 (10elukey)
[10:10:23] <_joe_>	 elukey: you ticket is a duplicate
[10:12:13] <wikibugs>	 10Operations, 10Parsoid: parsoid-rt.service keeps failing on ruthenium causing alerts in icinga - https://phabricator.wikimedia.org/T209781 (10Joe)
[10:12:17] <wikibugs>	 10Operations, 10Parsoid: parsoid-rt repeated failures on ruthenium (parsoid::testing) - https://phabricator.wikimedia.org/T209758 (10Joe)
[10:20:01] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:28:19] <wikibugs>	 10Operations, 10Performance-Team: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe) >>! In T206341#4753952, @Imarlier wrote: > @Joe Might be interesting to look at specific calls that appear to perform less well, to see if we can identify specific...
[10:31:51] <_joe_>	 .
[10:41:31] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[10:54:17] <wikibugs>	 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) @Nemo_bis And here is the super fast HiNet connection in the city, which I am even more sure always succeeds uploading.  {F27251642}
[11:01:16] <elukey>	 _joe_: I looked for 'ruthenium' in phab and didn't see anything, so I opened one, that's it :)
[11:01:19] <elukey>	 thanks for the merge
[11:55:07] <icinga-wm>	 PROBLEM - puppet last run on elastic2021 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[12:00:00] <wikibugs>	 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Notifications disablement via puppet not working on icinga - https://phabricator.wikimedia.org/T209757 (10jcrespo) > i think we should normally not use this method (disable notifications) and instead "schedule downtime" was the better option in the...
[12:28:51] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:41:15] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[12:56:05] <icinga-wm>	 PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 3106 mails in exim queue.
[14:23:36] <wikibugs>	 10Operations, 10Traffic: INMARSAT geolocates to the UK, leading to requests going to esams - https://phabricator.wikimedia.org/T209785 (10Reedy)
[14:31:55] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:40:59] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[15:17:11] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:40:55] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[16:00:09] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:06:41] <icinga-wm>	 PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:11:27] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[16:26:53] <icinga-wm>	 RECOVERY - exim queue on mx1001 is OK: OK: Less than 1000 mails in exim queue.
[16:28:39] <icinga-wm>	 PROBLEM - Host cp1078 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:17] <icinga-wm>	 RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:34:41] <icinga-wm>	 PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:34:47] <icinga-wm>	 PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:34:55] <icinga-wm>	 PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:34:55] <icinga-wm>	 PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:34:55] <icinga-wm>	 PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:34:55] <icinga-wm>	 PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:34:57] <icinga-wm>	 PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:03] <icinga-wm>	 PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:03] <icinga-wm>	 PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:11] <icinga-wm>	 PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:11] <icinga-wm>	 PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:11] <icinga-wm>	 PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:11] <icinga-wm>	 PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:13] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:13] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:19] <icinga-wm>	 PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:19] <icinga-wm>	 PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:23] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:23] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:23] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:25] <icinga-wm>	 PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:29] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:37] <icinga-wm>	 PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:41] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:41] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:47] <icinga-wm>	 PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:47] <icinga-wm>	 PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:47] <icinga-wm>	 PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:47] <icinga-wm>	 PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1078_v4, cp1078_v6
[16:35:51] <icinga-wm>	 PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:35:51] <icinga-wm>	 PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:36:01] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:36:03] <icinga-wm>	 PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:36:03] <icinga-wm>	 PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:36:07] <icinga-wm>	 PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:36:07] <icinga-wm>	 PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp1078_v4, cp1078_v6
[16:59:03] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:11:31] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[17:15:30] <onimisionipe>	 Anybody ^
[17:21:53] <andrewbogott>	 I don't know what that is but I'll see what i can see...
[17:22:32] <Krenair>	 something up with cp1078?
[17:23:13] <Krenair>	 I think this has come up before, should probably be something on wikitech about it
[17:25:16] <andrewbogott>	 yeah, cp1078 is completely down
[17:25:46] <Krenair>	 18<icinga-wm18> PROBLEM - Host cp1078 is DOWN: PING CRITICAL - Packet loss = 100%
[17:25:50] <Krenair>	 missed that one
[17:26:19] <andrewbogott>	 !log restarting cp1078 from mgmt console
[17:26:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:56] <Krenair>	 at least it's not IPsec/Strongswan itself :)
[17:30:17] <icinga-wm>	 RECOVERY - Host cp1078 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[17:30:17] <icinga-wm>	 RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK
[17:30:17] <icinga-wm>	 RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK
[17:30:19] <icinga-wm>	 RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK
[17:30:19] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 40 ESP OK
[17:30:19] <icinga-wm>	 RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 40 ESP OK
[17:30:25] <icinga-wm>	 RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 40 ESP OK
[17:30:25] <icinga-wm>	 RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 40 ESP OK
[17:30:29] <icinga-wm>	 RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 40 ESP OK
[17:30:29] <icinga-wm>	 RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 40 ESP OK
[17:30:35] <icinga-wm>	 RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK
[17:30:41] <icinga-wm>	 RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK
[17:30:41] <icinga-wm>	 RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK
[17:30:43] <icinga-wm>	 RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 40 ESP OK
[17:30:43] <icinga-wm>	 RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 40 ESP OK
[17:30:45] <icinga-wm>	 RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 40 ESP OK
[17:30:45] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 40 ESP OK
[17:30:45] <icinga-wm>	 RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 40 ESP OK
[17:30:51] <icinga-wm>	 RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK
[17:30:51] <icinga-wm>	 RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK
[17:30:51] <icinga-wm>	 RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 40 ESP OK
[17:30:51] <icinga-wm>	 RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 40 ESP OK
[17:30:53] <icinga-wm>	 RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 40 ESP OK
[17:30:55] <icinga-wm>	 RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 40 ESP OK
[17:30:55] <icinga-wm>	 RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 40 ESP OK
[17:30:57] <icinga-wm>	 RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK
[17:30:57] <icinga-wm>	 RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 40 ESP OK
[17:30:57] <icinga-wm>	 RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 40 ESP OK
[17:30:59] <icinga-wm>	 RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK
[17:31:03] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 40 ESP OK
[17:31:03] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 40 ESP OK
[17:31:05] <icinga-wm>	 RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 64 ESP OK
[17:31:15] <icinga-wm>	 RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 40 ESP OK
[17:31:15] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 40 ESP OK
[17:31:15] <icinga-wm>	 RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 40 ESP OK
[17:31:15] <icinga-wm>	 RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK
[17:31:21] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 40 ESP OK
[17:32:57] <wikibugs>	 10Operations: cp1078 crash - https://phabricator.wikimedia.org/T209791 (10Andrew)
[17:36:56] <wikibugs>	 10Operations: cp1078 crash - https://phabricator.wikimedia.org/T209791 (10Andrew) The syslog suggests that the box wasn't actually all the way down before I restarted it, just in distress.  Here are its last few minutes:   ` Nov 18 17:25:52 cp1078 confd[1060]: 2018-11-18T17:25:52Z cp1078 /usr/bin/confd[1060]: ER...
[17:39:48] <wikibugs>	 10Operations: cp1078 crash - https://phabricator.wikimedia.org/T209791 (10Andrew) ` Nov 18 16:27:31 cp1078 kernel: [1997356.199019] ------------[ cut here ]------------ Nov 18 16:27:31 cp1078 kernel: [1997356.199032] WARNING: CPU: 5 PID: 0 at /build/linux-IWeKxA/linux-4.9.110/net/sched/sch_generic.c:316 dev_watc...
[17:44:10] <andrewbogott>	 ok, this seems resolved for now, I'll leave parsing the kernel trace to those who enjoy such things :)
[17:51:50] <_joe_>	 just to be clear: one cache host going down can be withstood by our systems pretty well
[17:52:04] <_joe_>	 it's just a ton of alerts if it's an eqiad host
[19:46:12] <addshore>	 that indeed was a ton of alerts
[19:55:08] <Krenair>	 addshore, yeah a lot of hosts try to set up IPsec connections with a lot of other hosts
[19:55:23] <Krenair>	 if one breaks then you get a lot of alerts
[20:01:08] <addshore>	 Gotcha
[20:13:04] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack)
[20:13:06] <wikibugs>	 10Operations: cp1078 crash - https://phabricator.wikimedia.org/T209791 (10BBlack)
[20:13:28] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack) Yet another! cp1078 crash ticket above merged into here.
[20:14:38] <bblack>	 luckily, we configure the cp (cacheproxy) clusters in such a way that loss of one or more servers is rarely a critical thing!
[20:15:33] <bblack>	 ob https://xkcd.com/1737/
[20:22:11] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:42:03] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[21:01:09] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:12:15] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[21:52:25] <icinga-wm>	 PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27979 MB (5% inode=99%)
[21:53:31] <icinga-wm>	 RECOVERY - Disk space on elastic1017 is OK: DISK OK
[22:33:01] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Need to shut down a list - https://phabricator.wikimedia.org/T209726 (10Beeblebrox) my apologies for not including the name, which as it turns out was not easy to find, but it's this one   https://lists.wikimedia.org/mailman/listinfo/mediation-en-l
[22:33:40] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Need to shut down a list, mediation-en-l - https://phabricator.wikimedia.org/T209726 (10Beeblebrox)
[22:35:17] <icinga-wm>	 PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[22:35:24] <icinga-wm>	 PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[22:35:27] <icinga-wm>	 PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[22:35:31] <icinga-wm>	 PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[22:35:51] <icinga-wm>	 PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[22:36:15] <icinga-wm>	 PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[22:38:39] <icinga-wm>	 PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[22:39:37] <icinga-wm>	 RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up
[22:39:47] <icinga-wm>	 RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient
[22:39:53] <icinga-wm>	 RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[22:39:57] <icinga-wm>	 RECOVERY - Disk space on notebook1004 is OK: DISK OK
[22:40:01] <icinga-wm>	 RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational
[22:40:21] <icinga-wm>	 RECOVERY - DPKG on notebook1004 is OK: All packages OK
[22:43:47] <icinga-wm>	 RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[23:01:15] <icinga-wm>	 PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[23:02:11] <icinga-wm>	 PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[23:02:17] <icinga-wm>	 PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[23:02:19] <icinga-wm>	 PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[23:02:45] <icinga-wm>	 PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[23:03:05] <icinga-wm>	 PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[23:06:27] <icinga-wm>	 PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
[23:09:45] <icinga-wm>	 RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up
[23:09:57] <icinga-wm>	 RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient
[23:10:03] <icinga-wm>	 RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[23:10:05] <icinga-wm>	 RECOVERY - Disk space on notebook1004 is OK: DISK OK
[23:10:09] <icinga-wm>	 RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational
[23:10:31] <icinga-wm>	 RECOVERY - DPKG on notebook1004 is OK: All packages OK
[23:11:35] <icinga-wm>	 RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures