[03:25:38] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 786.06 seconds [03:56:38] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 192.20 seconds [05:01:44] (03PS1) 10Gergő Tisza: Enable loginOnly mode for local auth provider on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409638 (https://phabricator.wikimedia.org/T57420) [05:57:09] PROBLEM - HP RAID on ms-be1018 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging [05:57:13] ACKNOWLEDGEMENT - HP RAID on ms-be1018 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T186988 [05:57:17] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1018 - https://phabricator.wikimedia.org/T186988#3961204 (10ops-monitoring-bot) [07:12:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [07:13:19] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [07:49:29] (03PS1) 10Gergő Tisza: Re-enable cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/409645 [07:49:58] (03PS2) 10Gergő Tisza: Re-enable cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/409645 (https://phabricator.wikimedia.org/T181107) [07:50:11] (03PS3) 10Gergő Tisza: Re-enable cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/409645 (https://phabricator.wikimedia.org/T181107) [07:57:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [07:57:59] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:00:59] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:01:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [08:11:09] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:11:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [08:14:18] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:14:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [08:21:18] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:21:29] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [08:30:28] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:30:38] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [08:33:28] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:33:38] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [08:42:17] (03CR) 10Jayprakash12345: [C: 031] "please do it as soon as possible, :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408045 (https://phabricator.wikimedia.org/T185347) (owner: 10Urbanecm) [09:59:28] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [10:00:19] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [14:04:19] PROBLEM - Disk space on mx2001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied [14:06:43] !log installing exim4 security updates on MXs [14:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:19] RECOVERY - Disk space on mx2001 is OK: DISK OK [14:56:35] (03CR) 10Anomie: [C: 031] "Seems sane. Haven't tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409638 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [15:56:18] PROBLEM - puppet last run on analytics1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:21:18] RECOVERY - puppet last run on analytics1060 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:51:59] PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:18] PROBLEM - Nginx local proxy to apache on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:28] PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:24:28] RECOVERY - puppet last run on db1077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:40:41] (03CR) 10星耀晨曦: "Is there a SWAT deployer to deploy this patch? If this patch is no problem, please deploy it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406487 (https://phabricator.wikimedia.org/T184866) (owner: 10星耀晨曦) [17:45:27] 10Operations, 10Wikidata: Badges not displaying on trwiki - https://phabricator.wikimedia.org/T186815#3961602 (10Sjoerddebruin) [18:13:38] PROBLEM - Host dbproxy1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:18:48] RECOVERY - Host dbproxy1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [18:48:36] (03PS4) 10ArielGlenn: split up flow dumps into stubs and content passes [dumps] - 10https://gerrit.wikimedia.org/r/355077 (https://phabricator.wikimedia.org/T164262) [18:48:58] (03CR) 10jerkins-bot: [V: 04-1] split up flow dumps into stubs and content passes [dumps] - 10https://gerrit.wikimedia.org/r/355077 (https://phabricator.wikimedia.org/T164262) (owner: 10ArielGlenn) [20:22:58] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4029 is CRITICAL: connect to address 10.128.0.129 and port 3128: Connection refused [20:23:59] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4029 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.157 second response time [20:30:41] (03PS1) 10Gergő Tisza: Increase ReadingLists item limit to 5k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) [20:35:39] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 38.68, 34.78, 32.23 [21:16:48] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 40.04, 34.31, 32.08 [21:20:48] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 39.51, 34.52, 32.50 [21:32:48] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 40.11, 34.02, 32.69 [21:36:49] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 36.55, 32.90, 32.34 [22:07:50] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 43.79, 35.65, 31.85 [22:51:12] 10Operations, 10DBA, 10Performance-Team, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523#3961878 (10jcrespo) See my latest comments on: T167784#3961866 > The third one is a bigger question regarding active-acti... [23:22:38] (03PS2) 10Gergő Tisza: Increase ReadingLists item limit to 5k [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) [23:43:59] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 20.28, 22.65, 23.93 [23:51:28] (03CR) 10Jcrespo: "count(*) are not accelerated by indexes, at least not significantly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409712 (https://phabricator.wikimedia.org/T186296) (owner: 10Gergő Tisza)