[01:30:05] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[02:05:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/media/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Al
[02:05:25] <icinga-wm>	 obile-sections-lead) timed out before a response was received
[02:06:16] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[03:29:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 874.04 seconds
[03:32:06] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[cdh::hadoop::directory /user/spark/applicationHistory]
[03:34:35] <icinga-wm>	 PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:35:15] <icinga-wm>	 PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:35:15] <icinga-wm>	 PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:55:35] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:57:05] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:57:06] <icinga-wm>	 PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:57:16] <icinga-wm>	 PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:57:35] <icinga-wm>	 PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:57:45] <icinga-wm>	 PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds
[03:59:15] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 227.85 seconds
[04:00:15] <icinga-wm>	 RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[04:04:35] <icinga-wm>	 RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[04:05:15] <icinga-wm>	 RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[04:20:45] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds
[04:25:15] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational
[04:25:16] <icinga-wm>	 RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[04:25:26] <icinga-wm>	 RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient
[04:25:45] <icinga-wm>	 RECOVERY - configured eth on stat1005 is OK: OK - interfaces up
[04:25:46] <icinga-wm>	 RECOVERY - DPKG on stat1005 is OK: All packages OK
[04:27:06] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:27:45] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1345 is OK: OK - load average: 9.45, 21.31, 34.58
[04:50:45] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Sun 2018-02-18 04:50:41 UTC.
[05:18:35] <icinga-wm>	 PROBLEM - Apache HTTP on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:19:25] <icinga-wm>	 RECOVERY - Apache HTTP on mw2133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.122 second response time
[05:44:45] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.20, 32.16, 32.06
[05:44:58] <wikibugs>	 (03Draft2) 10Tulsi Bhagat: Deploy Draft namespace on hiwikiversity. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535)
[05:48:45] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 33.35, 32.05, 32.01
[05:55:46] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 33.34, 32.76, 32.22
[06:01:55] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.96, 32.62, 32.27
[07:24:32] <wikibugs>	 (03CR) 10Jayprakash12345: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat)
[07:29:25] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:30:15] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.199 second response time
[07:32:48] <wikibugs>	 (03CR) 10Jayprakash12345: [C: 031] Deploy Draft namespace on hiwikiversity. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat)
[08:38:16] <icinga-wm>	 PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100%
[08:38:55] <icinga-wm>	 PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100%
[08:38:57] <icinga-wm>	 PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:05] <icinga-wm>	 PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:05] <icinga-wm>	 PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:06] <icinga-wm>	 PROBLEM - Host logstash1007 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:06] <icinga-wm>	 PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:15] <icinga-wm>	 PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:15] <icinga-wm>	 PROBLEM - Host rutherfordium is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:16] <icinga-wm>	 PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:16] <icinga-wm>	 PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:26] <icinga-wm>	 PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:39:56] <icinga-wm>	 PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:40:52] <foks>	 that doesn't look good
[08:41:51] <legoktm>	 ganeti1006 died...
[08:42:37] <Jamesofur>	 O_o
[08:42:56] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5
[08:43:55] <icinga-wm>	 RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[08:45:45] <icinga-wm>	 RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms
[08:45:45] <icinga-wm>	 RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms
[08:45:55] <icinga-wm>	 RECOVERY - Host rutherfordium is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms
[08:45:55] <icinga-wm>	 RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms
[08:46:05] <icinga-wm>	 RECOVERY - Host dubnium is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms
[08:46:05] <icinga-wm>	 RECOVERY - Host logstash1007 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[08:46:05] <icinga-wm>	 RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms
[08:46:15] <icinga-wm>	 RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms
[08:46:15] <icinga-wm>	 RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms
[08:46:15] <icinga-wm>	 RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[08:46:45] <icinga-wm>	 RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms
[08:47:25] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[08:47:29] <foks>	 wheeee
[08:47:55] <icinga-wm>	 RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms
[08:57:35] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[08:59:05] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5
[09:40:25] <icinga-wm>	 PROBLEM - HHVM rendering on mw2127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:41:15] <icinga-wm>	 RECOVERY - HHVM rendering on mw2127 is OK: HTTP OK: HTTP/1.1 200 OK - 73246 bytes in 0.391 second response time
[12:46:46] <wikibugs>	 (03PS1) 10星耀晨曦: Set Topic namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546)
[12:47:48] <wikibugs>	 (03PS2) 10星耀晨曦: Set Topic namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546)
[13:04:15] <icinga-wm>	 PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[wmf-mariadb101-client]
[13:48:05] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 51.11, 49.61, 48.23
[14:34:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 50.35, 49.09, 48.04
[14:36:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 49.06, 48.77, 48.04
[14:52:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 54.70, 49.70, 48.23
[15:26:25] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 33.30, 32.89, 32.05
[15:41:36] <godog>	 there's other api appservers in warning in icinga, I'm looking at one of these critical here
[15:41:49] <_joe_>	 yeah
[15:41:57] <_joe_>	 I guess it's the usual problem
[15:42:01] <_joe_>	 but let me check as well
[15:42:29] <godog>	 _joe_: namely hhvm locking up ?
[15:42:59] <_joe_>	 HPHP::jit::detail::enterTC to be specific
[15:43:03] <_joe_>	 there is a deadlock there
[15:43:47] <godog>	 heheh fun times, anything to collect before restarting?
[15:44:28] <_joe_>	 not really I guess
[15:44:35] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 40.49, 34.93, 32.66
[15:46:10] <godog>	 so roll-restart via cumin of the appservers involved it is
[15:49:52] <_joe_>	 !log rolling restart (1 at a time, staggered by 2 minutes) of 18 api appservers in equiad
[15:50:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 52.29, 50.68, 48.20
[16:01:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 51.74, 49.06, 48.34
[16:05:05] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 10.77, 12.23, 23.77
[16:06:35] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 10.54, 14.90, 23.70
[16:12:26] <icinga-wm>	 PROBLEM - HHVM rendering on mw2144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:13:16] <icinga-wm>	 RECOVERY - HHVM rendering on mw2144 is OK: HTTP OK: HTTP/1.1 200 OK - 73031 bytes in 0.259 second response time
[16:24:26] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 37.25, 34.36, 32.17
[16:32:45] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1347 is OK: OK - load average: 15.77, 19.47, 35.49
[16:34:26] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 34.02, 32.77, 32.06
[16:37:26] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 34.30, 32.56, 32.04
[16:45:25] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 12.85, 21.57, 35.54
[16:49:35] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 33.14, 32.29, 32.12
[16:53:35] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 35.02, 32.88, 32.29
[17:59:15] <icinga-wm>	 RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[18:07:04] <wikibugs>	 10Operations, 10Puppet: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651#3981701 (10jcrespo)
[18:28:56] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 31.24, 30.78, 32.02
[19:21:56] <rxy>	 Any sysadmins with production database access around?
[19:24:05] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1221 is OK: OK - load average: 6.61, 15.08, 23.97
[19:27:18] <rxy>	 never mind.    toolforge replication database are lagged. but I got correct result now.
[19:42:16] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 33.24, 32.54, 32.01
[19:55:25] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 8.81, 14.64, 23.60
[20:18:37] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Setup cron for foreachwikiindblist all-labs.dblist extensions/AbuseFilter/maintenance/purgeOldLogIPData.php on Beta - https://phabricator.wikimedia.org/T187658#3981851 (10MarcoAurelio)
[21:04:03] <Hauskatze>	 Krinkle: you there? :)
[21:09:40] <Hauskatze>	 hi tgr
[21:10:15] <tgr>	 hi Hauskatze 
[21:12:31] <Hauskatze>	 tgr: I was wondering if you could access terbium from where you're?
[21:14:25] <tgr>	 Hauskatze: SF, and works for me
[21:14:51] <Hauskatze>	 tgr: it's a namespaceDupes.php for enwikiversity
[21:15:11] <Hauskatze>	 they changed namespaces and apparently they forgot to run it
[21:15:29] <Hauskatze>	 T187660
[21:15:29] <stashbot>	 T187660: en.wikiversity Draft Namespace Inaccessible Pages - https://phabricator.wikimedia.org/T187660
[21:16:41] <Hauskatze>	 if they did, the conflicting page_id would still apear and that'll help to identify/delete the conflicting page.
[21:17:28] <tgr>	 probably can wait until next workday? haven't used that script before and middle of Sunday is not the best time to do something stupid and break stuff
[21:17:52] <Hauskatze>	 tgr: sure, no problem; it's a safe script nonetheless; but it's not urgent, etc.
[21:18:17] <tgr>	 ping me tomorrow and I'll do it
[21:18:43] <Hauskatze>	 tgr: tomorrow is US hollyday
[21:18:53] <Hauskatze>	 so it's not a working day :P
[21:19:11] <tgr>	 most ops are in EU; that's good enough
[21:19:23] <Hauskatze>	 orly? I thought it was vice-versa
[21:19:26] <tgr>	 I won't be offended if you wait until Tuesday though :)
[21:19:56] <Hauskatze>	 I can wait all the time needed
[21:20:19] <Hauskatze>	 there's a ticket so in any case someone will pick it
[21:20:26] <Hauskatze>	 sometime :)
[21:49:22] <wikibugs>	 (03Draft1) 10MarcoAurelio: throttle: add new rule for Wikidata edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412606 (https://phabricator.wikimedia.org/T187655)
[21:49:29] <wikibugs>	 (03PS2) 10MarcoAurelio: throttle: add new rule for Wikidata edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412606 (https://phabricator.wikimedia.org/T187655)