[01:30:05] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:05:25] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/media/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Al [02:05:25] obile-sections-lead) timed out before a response was received [02:06:16] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [03:29:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 874.04 seconds [03:32:06] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[cdh::hadoop::directory /user/spark/applicationHistory] [03:34:35] PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:35:15] PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:35:15] PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:55:35] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:57:05] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:57:06] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:57:16] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:57:35] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:57:45] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [03:59:15] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 227.85 seconds [04:00:15] RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:04:35] RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:05:15] RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:20:45] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [04:25:15] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [04:25:16] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [04:25:26] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [04:25:45] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [04:25:46] RECOVERY - DPKG on stat1005 is OK: All packages OK [04:27:06] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:27:45] RECOVERY - High CPU load on API appserver on mw1345 is OK: OK - load average: 9.45, 21.31, 34.58 [04:50:45] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Sun 2018-02-18 04:50:41 UTC. [05:18:35] PROBLEM - Apache HTTP on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:25] RECOVERY - Apache HTTP on mw2133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.122 second response time [05:44:45] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.20, 32.16, 32.06 [05:44:58] (03Draft2) 10Tulsi Bhagat: Deploy Draft namespace on hiwikiversity. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) [05:48:45] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 33.35, 32.05, 32.01 [05:55:46] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 33.34, 32.76, 32.22 [06:01:55] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.96, 32.62, 32.27 [07:24:32] (03CR) 10Jayprakash12345: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat) [07:29:25] PROBLEM - Nginx local proxy to apache on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:15] RECOVERY - Nginx local proxy to apache on mw2134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.199 second response time [07:32:48] (03CR) 10Jayprakash12345: [C: 031] Deploy Draft namespace on hiwikiversity. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat) [08:38:16] PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100% [08:38:55] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:38:57] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:05] PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:05] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:06] PROBLEM - Host logstash1007 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:06] PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:15] PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 100% [08:39:15] PROBLEM - Host rutherfordium is DOWN: PING CRITICAL - Packet loss = 100% [08:39:16] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:16] PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100% [08:39:26] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:56] PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:40:52] that doesn't look good [08:41:51] ganeti1006 died... [08:42:37] O_o [08:42:56] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [08:43:55] RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [08:45:45] RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [08:45:45] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [08:45:55] RECOVERY - Host rutherfordium is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [08:45:55] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [08:46:05] RECOVERY - Host dubnium is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [08:46:05] RECOVERY - Host logstash1007 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [08:46:05] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [08:46:15] RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [08:46:15] RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [08:46:15] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [08:46:45] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [08:47:25] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:47:29] wheeee [08:47:55] RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [08:57:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:59:05] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [09:40:25] PROBLEM - HHVM rendering on mw2127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:15] RECOVERY - HHVM rendering on mw2127 is OK: HTTP OK: HTTP/1.1 200 OK - 73246 bytes in 0.391 second response time [12:46:46] (03PS1) 10星耀晨曦: Set Topic namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) [12:47:48] (03PS2) 10星耀晨曦: Set Topic namespace alias of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412439 (https://phabricator.wikimedia.org/T187546) [13:04:15] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[wmf-mariadb101-client] [13:48:05] PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 51.11, 49.61, 48.23 [14:34:15] PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 50.35, 49.09, 48.04 [14:36:15] PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 49.06, 48.77, 48.04 [14:52:15] PROBLEM - High CPU load on API appserver on mw1347 is CRITICAL: CRITICAL - load average: 54.70, 49.70, 48.23 [15:26:25] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 33.30, 32.89, 32.05 [15:41:36] there's other api appservers in warning in icinga, I'm looking at one of these critical here [15:41:49] <_joe_> yeah [15:41:57] <_joe_> I guess it's the usual problem [15:42:01] <_joe_> but let me check as well [15:42:29] _joe_: namely hhvm locking up ? [15:42:59] <_joe_> HPHP::jit::detail::enterTC to be specific [15:43:03] <_joe_> there is a deadlock there [15:43:47] heheh fun times, anything to collect before restarting? [15:44:28] <_joe_> not really I guess [15:44:35] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 40.49, 34.93, 32.66 [15:46:10] so roll-restart via cumin of the appservers involved it is [15:49:52] <_joe_> !log rolling restart (1 at a time, staggered by 2 minutes) of 18 api appservers in equiad [15:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:15] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 52.29, 50.68, 48.20 [16:01:15] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 51.74, 49.06, 48.34 [16:05:05] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 10.77, 12.23, 23.77 [16:06:35] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 10.54, 14.90, 23.70 [16:12:26] PROBLEM - HHVM rendering on mw2144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:16] RECOVERY - HHVM rendering on mw2144 is OK: HTTP OK: HTTP/1.1 200 OK - 73031 bytes in 0.259 second response time [16:24:26] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 37.25, 34.36, 32.17 [16:32:45] RECOVERY - High CPU load on API appserver on mw1347 is OK: OK - load average: 15.77, 19.47, 35.49 [16:34:26] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 34.02, 32.77, 32.06 [16:37:26] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 34.30, 32.56, 32.04 [16:45:25] RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 12.85, 21.57, 35.54 [16:49:35] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 33.14, 32.29, 32.12 [16:53:35] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 35.02, 32.88, 32.29 [17:59:15] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:07:04] 10Operations, 10Puppet: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651#3981701 (10jcrespo) [18:28:56] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 31.24, 30.78, 32.02 [19:21:56] Any sysadmins with production database access around? [19:24:05] RECOVERY - High CPU load on API appserver on mw1221 is OK: OK - load average: 6.61, 15.08, 23.97 [19:27:18] never mind. toolforge replication database are lagged. but I got correct result now. [19:42:16] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 33.24, 32.54, 32.01 [19:55:25] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 8.81, 14.64, 23.60 [20:18:37] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Setup cron for foreachwikiindblist all-labs.dblist extensions/AbuseFilter/maintenance/purgeOldLogIPData.php on Beta - https://phabricator.wikimedia.org/T187658#3981851 (10MarcoAurelio) [21:04:03] Krinkle: you there? :) [21:09:40] hi tgr [21:10:15] hi Hauskatze [21:12:31] tgr: I was wondering if you could access terbium from where you're? [21:14:25] Hauskatze: SF, and works for me [21:14:51] tgr: it's a namespaceDupes.php for enwikiversity [21:15:11] they changed namespaces and apparently they forgot to run it [21:15:29] T187660 [21:15:29] T187660: en.wikiversity Draft Namespace Inaccessible Pages - https://phabricator.wikimedia.org/T187660 [21:16:41] if they did, the conflicting page_id would still apear and that'll help to identify/delete the conflicting page. [21:17:28] probably can wait until next workday? haven't used that script before and middle of Sunday is not the best time to do something stupid and break stuff [21:17:52] tgr: sure, no problem; it's a safe script nonetheless; but it's not urgent, etc. [21:18:17] ping me tomorrow and I'll do it [21:18:43] tgr: tomorrow is US hollyday [21:18:53] so it's not a working day :P [21:19:11] most ops are in EU; that's good enough [21:19:23] orly? I thought it was vice-versa [21:19:26] I won't be offended if you wait until Tuesday though :) [21:19:56] I can wait all the time needed [21:20:19] there's a ticket so in any case someone will pick it [21:20:26] sometime :) [21:49:22] (03Draft1) 10MarcoAurelio: throttle: add new rule for Wikidata edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412606 (https://phabricator.wikimedia.org/T187655) [21:49:29] (03PS2) 10MarcoAurelio: throttle: add new rule for Wikidata edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412606 (https://phabricator.wikimedia.org/T187655)