[00:01:21] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [00:01:21] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [00:05:41] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:06:41] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [00:07:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:08:31] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [00:10:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:10:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:15:10] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:16:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [00:16:40] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [00:19:51] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [00:25:20] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:25:20] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:28:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 23 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [00:29:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:30:30] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [00:30:50] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [00:30:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [00:33:41] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:34:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:34:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:34:50] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [00:35:41] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 48 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [00:37:20] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) timed out before a response was received [00:38:20] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [00:40:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [00:42:10] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [00:42:24] (03PS1) 10Krinkle: mediawiki: Remove unused 'role::logging::mediawiki::errors' [puppet] - 10https://gerrit.wikimedia.org/r/467103 [00:42:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [00:42:51] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [00:44:11] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [00:45:41] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [00:46:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:46:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:48:56] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/CentralAuth/includes/specials/SpecialGlobalGroupMembership.php: T203767 - If2bfa092b (duration: 00m 50s) [00:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:00] T203767: Notice: Uninitialized string offset: 0 in /srv/mediawiki/php-1.32.0-wmf.20/extensions/CentralAuth/includes/specials/SpecialGlobalGroupMembership.php on line 102 - https://phabricator.wikimedia.org/T203767 [00:49:10] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:50:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [00:50:30] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [00:53:21] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 266, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:53:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:54:50] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [00:57:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [00:59:50] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [01:00:20] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:00:20] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:00:51] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [01:02:11] PROBLEM - HHVM rendering on mw1314 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [01:03:20] RECOVERY - HHVM rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 75635 bytes in 0.138 second response time [01:03:40] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [01:03:40] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [01:04:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:06:31] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [01:07:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [01:07:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:12:21] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [01:12:21] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [01:16:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [01:18:12] (03PS1) 10MGChecker: Reduce Codesniffer exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467104 [01:18:56] (03CR) 10jerkins-bot: [V: 04-1] Reduce Codesniffer exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467104 (owner: 10MGChecker) [01:22:51] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [01:24:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:24:30] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [01:27:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [01:31:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 24 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:33:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [01:33:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [01:36:20] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:37:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [01:38:30] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [01:38:31] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [01:38:50] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 63 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:39:31] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [01:40:31] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) timed out before a response was received [01:41:31] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [01:43:41] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [01:44:41] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [01:45:11] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:45:11] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [01:47:21] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [01:48:30] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [01:50:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:52:11] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received [01:52:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [01:53:11] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [01:56:21] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:56:21] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:59:31] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [02:00:08] (03PS2) 10MGChecker: Reduce Codesniffer exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467104 [02:00:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [02:01:07] (03CR) 10jerkins-bot: [V: 04-1] Reduce Codesniffer exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467104 (owner: 10MGChecker) [02:02:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [02:06:50] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [02:08:09] 10Operations, 10monitoring, 10Performance-Team (Radar): "Workers" data from prometheus for mw app servers alternates strangely - https://phabricator.wikimedia.org/T206939 (10Krinkle) [02:08:21] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:08:21] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:08:46] (03PS2) 10Paladox: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [02:08:48] (03PS3) 10MGChecker: Reduce Codesniffer exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467104 (https://phabricator.wikimedia.org/T45956) [02:09:59] (03CR) 10jerkins-bot: [V: 04-1] Reduce Codesniffer exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467104 (https://phabricator.wikimedia.org/T45956) (owner: 10MGChecker) [02:10:20] (03PS3) 10Paladox: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [02:10:40] 10Operations, 10monitoring, 10Performance-Team (Radar): "Workers" data from prometheus for mw app servers alternates strangely - https://phabricator.wikimedia.org/T206939 (10Krinkle) I initially thought it was just a change of color due to the order of the metrics being indeterministic. But, that's not it.... [02:13:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [02:14:20] (03PS4) 10Paladox: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [02:14:30] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [02:15:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [02:16:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [02:21:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:21:30] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:22:30] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [02:23:40] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [02:23:40] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [02:24:30] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [02:25:17] (03CR) 10Paladox: "Here is what it looks like:" [puppet] - 10https://gerrit.wikimedia.org/r/467100 (owner: 10Paladox) [02:26:31] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:27:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [02:28:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [02:30:11] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [02:32:01] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [02:33:31] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:39:23] (03CR) 10Krinkle: [C: 04-1] Planet: Redesgn UI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467100 (owner: 10Paladox) [02:39:40] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [02:40:40] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [02:40:50] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:43:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [02:43:30] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [02:43:30] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [02:46:50] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [02:46:50] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:47:31] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:49:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [02:49:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [02:50:24] (03PS5) 10Paladox: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [02:50:49] (03CR) 10Paladox: Planet: Redesgn UI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467100 (owner: 10Paladox) [02:50:51] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [02:55:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [02:55:41] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [02:55:41] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [02:56:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received [02:56:41] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [02:56:41] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [02:57:00] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [03:00:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [03:01:10] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [03:03:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [03:06:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [03:06:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [03:09:51] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [03:09:51] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [03:10:20] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:10:41] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [03:15:21] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [03:16:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [03:16:31] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [03:20:40] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [03:20:50] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [03:20:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [03:21:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [03:24:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [03:24:11] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [03:24:50] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 62 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:25:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [03:25:11] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [03:25:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [03:27:20] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 832.62 seconds [03:29:20] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [03:29:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [03:29:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [03:32:50] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [03:32:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [03:34:50] PROBLEM - puppet last run on mw2277 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz],File[/usr/share/GeoIP/GeoIPCity.dat.test] [03:35:00] PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test] [03:35:21] PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test] [03:35:51] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [03:36:11] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [03:36:11] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [03:37:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [03:38:00] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [03:38:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [03:40:00] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [03:42:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [03:46:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [03:46:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [03:47:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [03:48:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [04:00:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 27 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [04:00:40] RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [04:04:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [04:04:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [04:04:40] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [04:05:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [04:05:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [04:05:11] RECOVERY - puppet last run on mw2277 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:05:21] RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:06:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [04:07:30] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 38 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [04:09:11] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [04:10:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [04:13:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [04:13:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [04:15:41] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [04:15:41] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [04:17:31] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [04:19:30] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) timed out before a response was received [04:19:50] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [04:20:30] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [04:20:40] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 240.54 seconds [04:23:30] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [04:24:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [04:27:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [04:27:51] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [04:31:11] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [04:31:11] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [04:31:51] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [04:33:21] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [04:34:01] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [04:36:50] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [04:42:10] PROBLEM - Restbase root url on restbase1017 is CRITICAL: connect to address 10.64.32.129 and port 7231: Connection refused [04:42:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [04:42:11] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [04:49:40] RECOVERY - Restbase root url on restbase1017 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.005 second response time [05:07:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:07:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:07:51] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200) [05:08:00] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [05:08:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:08:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:09:00] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [05:10:01] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [05:10:20] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [05:10:41] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [05:11:41] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [05:15:10] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) timed out before a response was received [05:15:11] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [05:15:41] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 44 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:16:10] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [05:17:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:18:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [05:19:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [05:19:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [05:20:40] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [05:20:40] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [05:21:01] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [05:24:01] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:24:01] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:24:50] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:26:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [05:26:11] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [05:27:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [05:29:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [05:30:11] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [05:31:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:32:21] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [05:37:30] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) timed out before a response was received [05:38:51] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [05:39:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [05:41:01] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [05:41:21] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [05:41:30] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [05:44:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [05:44:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:45:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:47:30] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [05:47:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [05:47:50] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [05:51:11] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [05:51:11] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:52:20] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [05:55:20] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [05:55:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:56:20] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:56:20] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [05:56:50] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [05:57:06] (03PS1) 10Gergő Tisza: Bring up password change logging to the same standards as login logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 [05:59:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [06:00:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [06:00:30] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title}{/revision}{/ [06:00:30] page via mobile-sections) timed out before a response was received [06:01:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [06:03:21] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:03:40] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 61 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:05:11] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [06:07:20] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [06:08:41] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 27 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:12:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [06:12:50] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:15:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [06:16:01] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 46 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:18:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:18:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:21:31] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [06:23:24] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) eqiad db API hosts return the query in less than a second as they have the correct schema: ``` root@db1104.eqiad.w... [06:24:50] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [06:26:51] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [06:27:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [06:28:07] (03CR) 10Gergő Tisza: "Thanks for wrapping this up!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [06:28:11] PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:29:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [06:31:00] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/phaste] [06:31:11] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:32:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:32:31] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:34:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received [06:37:10] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [06:38:21] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 75 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:40:50] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:43:00] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:45:11] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [06:46:40] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [06:50:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:50:51] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) timed out before a response was received [06:51:01] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [06:53:01] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [06:54:21] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [06:56:11] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [06:56:20] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:58:41] RECOVERY - puppet last run on wdqs1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:00:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [07:00:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:03:51] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [07:04:10] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [07:07:01] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [07:07:21] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [07:16:20] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:16:20] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:17:50] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [07:20:01] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [07:21:41] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [07:21:41] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [07:23:21] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) timed out before a response was received [07:25:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [07:25:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [07:26:10] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [07:26:30] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [07:29:30] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [07:29:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [07:35:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:37:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:38:01] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [07:40:20] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [07:40:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [07:41:11] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [07:43:20] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [07:43:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:43:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:48:01] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [07:48:01] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [07:51:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [07:51:31] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:51:31] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [07:53:21] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [07:54:41] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out before a response was received [07:56:50] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [07:59:10] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [08:00:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [08:01:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [08:02:30] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [08:06:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [08:12:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [08:17:20] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [08:17:30] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [08:17:30] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [08:18:21] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.039 second response time [08:18:31] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 76170 bytes in 0.410 second response time [08:18:31] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.035 second response time [08:23:01] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [08:24:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [08:24:00] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [08:27:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [08:30:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [08:31:30] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [08:34:49] !log powercycle restbase1015 (frozen, no ssh, no metrics, no root console via serial available) [08:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:01] RECOVERY - Host restbase1015 is UP: PING WARNING - Packet loss = 54%, RTA = 336.98 ms [08:38:40] PROBLEM - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:38:40] PROBLEM - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:38:51] PROBLEM - cassandra-b SSL 10.64.48.139:7001 on restbase1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:39:21] PROBLEM - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.138 and port 9042: Connection refused [08:39:21] PROBLEM - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.140 and port 9042: Connection refused [08:40:02] those are bootstrapping, checking --^ [08:40:11] PROBLEM - cassandra-b CQL 10.64.48.139:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.139 and port 9042: Connection refused [08:40:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:41:21] yep [08:45:58] ok latencies are going down, that seems good [08:46:45] the only thing that I am worried about is a bit of inconsistency since the restbase1015's instances (3) will have stale data, and hinted hand-offs are not usable IIRC [08:47:41] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 63 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:48:31] RECOVERY - cassandra-a SSL 10.64.48.138:7001 on restbase1015 is OK: SSL OK - Certificate restbase1015-a valid until 2020-06-24 13:01:11 +0000 (expires in 619 days) [08:48:40] RECOVERY - cassandra-c SSL 10.64.48.140:7001 on restbase1015 is OK: SSL OK - Certificate restbase1015-c valid until 2020-06-24 13:01:13 +0000 (expires in 619 days) [08:49:00] RECOVERY - cassandra-b SSL 10.64.48.139:7001 on restbase1015 is OK: SSL OK - Certificate restbase1015-b valid until 2020-06-24 13:01:12 +0000 (expires in 619 days) [08:49:11] RECOVERY - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is OK: TCP OK - 0.000 second response time on 10.64.48.138 port 9042 [08:49:14] goood [08:49:20] RECOVERY - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is OK: TCP OK - 0.004 second response time on 10.64.48.140 port 9042 [08:50:12] RECOVERY - cassandra-b CQL 10.64.48.139:9042 on restbase1015 is OK: TCP OK - 0.000 second response time on 10.64.48.139 port 9042 [08:54:45] !log restart Yarn resource manager on an-master1002 to force an-master1001 to take the leadership back - T206943 [08:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:49] T206943: JVM pauses cause Yarn master to failover - https://phabricator.wikimedia.org/T206943 [08:54:56] lovely sunday [09:04:50] 10Operations, 10Elasticsearch, 10Icinga, 10Discovery-Search (Current work), 10Patch-For-Review: reconfigure Icinga alert for elasticsearch_shard_size to reduce false positive alerts - https://phabricator.wikimedia.org/T206187 (10Mathew.onipe) After watching the trend of this check for about a week now, I... [09:23:10] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:37:30] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 57 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:52:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:14:20] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on einsteinium is OK: (C)130 ge (W)110 ge 105.2 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [10:17:20] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 46 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:22:20] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:29:30] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 69 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:44:45] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) (owner: 10Zoranzoki21) [10:54:50] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:02:10] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [12:13:40] (03PS6) 10Zoranzoki21: Edited syntax of the code where is the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485 [12:22:34] (03PS4) 10MGChecker: Reduce Codesniffer exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467104 [12:23:33] (03CR) 10jerkins-bot: [V: 04-1] Reduce Codesniffer exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467104 (owner: 10MGChecker) [13:19:01] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:26:10] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 48 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:34:42] (03PS1) 10MGChecker: Allow creation of TemplateStyles in Module namspace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467123 (https://phabricator.wikimedia.org/T200914) [14:01:40] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:08:51] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 72 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:15:06] PROBLEM - Host text-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:15:12] PROBLEM - Host text-lb.eqsin.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:13] PROBLEM - Host cr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [14:15:23] PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:28] PROBLEM - Host upload-lb.eqsin.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:31] really? [14:15:34] PROBLEM - Host upload-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:15:34] PROBLEM - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [14:15:34] PROBLEM - Host cr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:58] hi [14:16:03] hello [14:16:07] hey [14:16:09] hey [14:16:24] depooling [14:16:24] hm [14:16:29] k [14:16:33] (03PS1) 10BBlack: depool eqsin, apparent outage [dns] - 10https://gerrit.wikimedia.org/r/467125 [14:16:44] ack [14:16:49] (03CR) 10BBlack: [C: 032] depool eqsin, apparent outage [dns] - 10https://gerrit.wikimedia.org/r/467125 (owner: 10BBlack) [14:17:40] <_joe_> I was convinced it was already depooled [14:17:56] <_joe_> so I came here with relative calm [14:17:57] could be just our ability to monitor, too, but either way that means probably some kind of backhaul problems for eqsin->codfw [14:18:12] note the hosts didn't go down, just all the public reachability [14:18:20] <_joe_> yes [14:18:27] but I donno, maybe cr1-eqsin outage blocked the reports of the hosts down [14:18:27] so the power issues aren't the cause? [14:18:40] the power window was over like 1.5 days ago [14:19:09] no, they just emailed an hour ago [14:19:11] Maybe this? https://phabricator.wikimedia.org/T206861 [14:19:17] saying that "the maintenace will commence in one hour" [14:19:25] can't be a coincidence [14:19:25] bblack: yeah I think that's what happened, eqsin hosts are marked unreachable by icinga not down [14:19:38] oh, yes [14:19:47] eqsin's ranges aren't routable over the internet as far as I can see [14:19:51] so, 2x power windows in one weekend, neither of them that we're really aware of [14:19:57] wtf? [14:20:16] (03PS6) 10Paladox: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [14:20:35] both ended up being surprises. We had prenotification of the first one, which nobody saw in their email and never made it to any outage calendar [14:20:36] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:20:36] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:20:37] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:20:42] <_joe_> bblack: the first one was our fault, though [14:20:46] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:20:46] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:20:47] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:20:49] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:20:51] <_joe_> not knowing about it, I mean [14:20:56] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:20:56] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:20:57] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:20:57] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:20:57] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:20:57] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:20:57] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:06] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:14] well the real question is how a loss of redundancy event becomes an outage [14:21:16] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:16] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:16] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:17] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:21:17] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:17] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:17] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:18] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:21:26] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:21:26] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:26] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:26] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:36] PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:21:36] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:36] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:21:36] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp5001_v4, cp5001_v6, cp5002_v4, cp5002_v6, cp5003_v4, cp5003_v6, cp5004_v4, cp5004_v6, cp5005_v4, cp5005_v6, cp5006_v4, cp5006_v6 [14:21:37] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:21:37] PROBLEM - IPsec on cp1075 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:21:37] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp5007_v4, cp5007_v6, cp5008_v4, cp5008_v6, cp5009_v4, cp5009_v6, cp5010_v4, cp5010_v6, cp5011_v4, cp5011_v6, cp5012_v4, cp5012_v6 [14:21:45] paravoid: yes, that too, but regardless [14:21:55] apparently the second outage was also prenotified over a month ago as well [14:22:08] when looking at the old email for the first, I guess nobody noticed the other one in the mix either [14:26:53] shall we update the topic? I would if I had privs [14:27:13] there's not really anything to update [14:27:45] (03PS7) 10Paladox: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [14:28:15] eqsin died, which would've temporarily impacted users in asia, but it's already been depooled, and by the time we update topic and they get here to read it, it's already fixed for them. [14:28:25] (assuming their DNS isn't broken, which we can't in general fix for them) [14:28:49] (03CR) 10Paladox: "The latest commit fixes some issues on mobile (also makes more images responsive)." [puppet] - 10https://gerrit.wikimedia.org/r/467100 (owner: 10Paladox) [14:54:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:02:06] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:03:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:10:36] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 49 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:25:13] RECOVERY - Host upload-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 238.42 ms [16:25:13] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [16:25:13] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 68 ESP OK [16:25:26] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 68 ESP OK [16:25:26] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 52 ESP OK [16:25:26] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [16:25:26] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 68 ESP OK [16:25:33] RECOVERY - Host text-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 250.03 ms [16:25:34] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK [16:25:34] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 52 ESP OK [16:25:36] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 226.04 ms [16:25:36] RECOVERY - Host lvs5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 235.00 ms [16:25:36] RECOVERY - Host bast5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 236.00 ms [16:25:36] RECOVERY - Host dns5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 238.19 ms [16:25:36] RECOVERY - Host mr1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 236.60 ms [16:25:36] RECOVERY - Host cp5010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 245.06 ms [16:25:37] RECOVERY - Host cp5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 239.90 ms [16:25:37] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 248.13 ms [16:25:38] RECOVERY - Host cp5009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 251.74 ms [16:25:38] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 261.36 ms [16:25:39] RECOVERY - Host cp5005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 264.91 ms [16:25:43] RECOVERY - Host text-lb.eqsin.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 235.93 ms [16:25:43] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 68 ESP OK [16:25:46] RECOVERY - Host cr1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 237.51 ms [16:25:46] RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 52 ESP OK [16:25:56] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK [16:25:56] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 52 ESP OK [16:25:56] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 52 ESP OK [16:25:57] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK [16:25:58] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 52 ESP OK [16:25:58] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 52 ESP OK [16:25:59] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:26:13] RECOVERY - Host upload-lb.eqsin.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 235.92 ms [16:26:13] RECOVERY - IPsec on cp1081 is OK: Strongswan OK - 52 ESP OK [16:26:13] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 68 ESP OK [16:26:13] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [16:26:14] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [16:26:14] RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 52 ESP OK [16:26:14] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 68 ESP OK [16:26:15] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [16:26:15] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 52 ESP OK [16:26:16] RECOVERY - IPsec on cp1075 is OK: Strongswan OK - 52 ESP OK [16:26:26] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [16:26:26] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [16:26:26] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 68 ESP OK [16:26:27] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [16:26:27] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [16:26:27] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [16:26:27] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 68 ESP OK [16:26:46] RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 52 ESP OK [16:26:46] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 52 ESP OK [16:26:56] RECOVERY - Host cr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 241.95 ms [16:27:47] PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:28:06] RECOVERY - Host lvs5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 235.19 ms [16:28:06] RECOVERY - Host cp5004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 241.94 ms [16:28:06] RECOVERY - Host cp5006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 251.90 ms [16:28:06] RECOVERY - Host cp5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 258.85 ms [16:28:06] RECOVERY - Host lvs5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 255.24 ms [16:28:07] PROBLEM - Check whether ferm is active by checking the default input chain on bast5001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:28:07] PROBLEM - puppet last run on cp5004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:08] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: Domain www.wikipedia.org was not found by the server [16:28:08] PROBLEM - puppet last run on dns5002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:16] PROBLEM - Webrequests Varnishkafka log producer on cp5007 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [16:28:16] PROBLEM - Varnish HTCP daemon on cp5007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 116 (vhtcpd), args vhtcpd [16:28:16] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53: Servers dns5002.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers dns5002.wikimedia.org are marked down but pooled: dns_rec6_53_udp: Servers dns5002.wikimedia.org are marked down but pooled [16:28:17] PROBLEM - NTP peers on dns5002 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [16:28:17] PROBLEM - eventlogging Varnishkafka log producer on cp5007 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf [16:28:18] PROBLEM - statsv Varnishkafka log producer on cp5007 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf [16:28:19] PROBLEM - Check systemd state on bast5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:28:36] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: Domain www.wikipedia.org was not found by the server [16:28:36] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53: Servers dns5002.wikimedia.org are marked down but pooled [16:28:46] RECOVERY - Host dns5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 244.49 ms [16:28:46] RECOVERY - Host cp5011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 243.64 ms [16:28:46] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:47] PROBLEM - Check whether ferm is active by checking the default input chain on dns5001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:28:56] PROBLEM - Check systemd state on dns5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:29:06] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:29:07] PROBLEM - puppet last run on dns5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:29:16] PROBLEM - puppet last run on cp5011 is CRITICAL: CRITICAL: Puppet has 16 failures. Last run 3 minutes ago with 16 failures. Failed resources (up to 3 shown): Exec[ip route add 2620::861:101:10:64:0:130/128 via mtu lock 1450 dev enp5s0f0],Exec[ip route add 2620::861:101:10:64:0:132/128 via mtu lock 1450 dev enp5s0f0],Exec[ip route add 2620::861:102:10:64:16:22/128 via mtu lock 1450 dev enp5s0f0],Exec[ip route add 2620::861:102:1 [16:29:16] mtu lock 1450 dev enp5s0f0] [16:29:27] PROBLEM - puppet last run on cp5005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:30:06] PROBLEM - puppet last run on cp5007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:30:06] PROBLEM - puppet last run on cp5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:30:06] RECOVERY - Host cp5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 236.87 ms [16:30:06] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 223.53 ms [16:30:06] RECOVERY - Host cp5012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 236.57 ms [16:30:07] RECOVERY - Host cp5008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 250.71 ms [16:30:07] RECOVERY - Host cp5007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 247.77 ms [16:30:26] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is OK: DNS OK: 0.461 seconds response time. www.wikipedia.org returns 208.80.154.224 [16:30:27] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy [16:30:37] PROBLEM - puppet last run on cp5009 is CRITICAL: CRITICAL: Puppet has 16 failures. Last run 4 minutes ago with 16 failures. Failed resources (up to 3 shown): Exec[ip route add 2620::861:101:10:64:0:130/128 via mtu lock 1450 dev enp5s0f0],Exec[ip route add 2620::861:101:10:64:0:132/128 via mtu lock 1450 dev enp5s0f0],Exec[ip route add 2620::861:102:10:64:16:22/128 via mtu lock 1450 dev enp5s0f0],Exec[ip route add 2620::861:102:1 [16:30:37] mtu lock 1450 dev enp5s0f0] [16:30:47] RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS OK: 0.248 seconds response time. www.wikipedia.org returns 208.80.154.224 [16:30:47] RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy [16:31:36] RECOVERY - NTP peers on dns5002 is OK: NTP OK: Offset -0.00433 secs [16:31:37] PROBLEM - puppet last run on cp5010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:31:37] PROBLEM - puppet last run on cp5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:31:37] PROBLEM - puppet last run on cp5006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:32:07] PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sdk1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sdk1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [16:32:16] PROBLEM - puppet last run on lvs5002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:16] PROBLEM - NTP peers on dns5001 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [16:33:17] RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational [16:33:37] RECOVERY - Webrequests Varnishkafka log producer on cp5007 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [16:33:37] RECOVERY - Varnish HTCP daemon on cp5007 is OK: PROCS OK: 1 process with UID = 116 (vhtcpd), args vhtcpd [16:33:46] RECOVERY - eventlogging Varnishkafka log producer on cp5007 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf [16:33:47] RECOVERY - statsv Varnishkafka log producer on cp5007 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf [16:34:03] !log forcing a puppet run on all eqsin hosts with batch 1 to clear most of the alarms - T206861 [16:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:07] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:34:07] T206861: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 [16:34:17] RECOVERY - NTP peers on dns5001 is OK: NTP OK: Offset 0.011816 secs [16:35:02] 10Operations, 10Traffic: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) Update: a few hours later, power seemingly got back, so at 2018-10-13 03:07 UTC @bblack repooled eqsin (logged at SAL). Unfortunately, power never got back to cr1-eqsin's PEM 0, asw1-eqsin's PEM 0 and th... [16:35:06] RECOVERY - puppet last run on cp5007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:36:37] RECOVERY - puppet last run on cp5006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:36:37] RECOVERY - puppet last run on cp5003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:38:01] 10Operations, 10Traffic: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10Volans) Once recovered we found that those hosts had a 5 minutes uptime: ``` dns5001.wikimedia.org lvs5001.eqsin.wmnet bast5001.wikimedia.org cp5011.eqsin.wmnet cp5009.eqsin.wmnet cp5007.eqsin.wmnet ``` Looking... [16:42:17] RECOVERY - puppet last run on lvs5002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:43:57] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:44:26] PROBLEM - IPMI Sensor Status on lvs5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [16:44:37] PROBLEM - IPMI Sensor Status on lvs5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [16:44:46] RECOVERY - puppet last run on cp5005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:46:27] PROBLEM - IPMI Sensor Status on cp5008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [16:46:27] PROBLEM - IPMI Sensor Status on cp5012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [16:47:25] 10Operations, 10Traffic: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) Respond from Equinix: > With regards to this Trouble ticket, we went onsite and observed the following, > R0604 A Feed is still on live and all equipment are still powered up > R0603- A Feed in-rack break... [16:48:27] RECOVERY - puppet last run on cp5004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:49:27] RECOVERY - puppet last run on dns5001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:17] RECOVERY - puppet last run on cp5001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:50:56] RECOVERY - puppet last run on cp5009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:50:56] PROBLEM - IPMI Sensor Status on cp5005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] [16:52:37] PROBLEM - IPMI Sensor Status on cp5004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [16:52:37] PROBLEM - IPMI Sensor Status on cp5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [16:53:36] RECOVERY - puppet last run on dns5002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [16:54:37] RECOVERY - puppet last run on cp5011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:56:27] PROBLEM - IPMI Sensor Status on cp5010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [16:56:56] RECOVERY - puppet last run on cp5010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:59:36] PROBLEM - IPMI Sensor Status on cp5006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] [17:02:46] PROBLEM - IPMI Sensor Status on cp5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:06:56] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 58.54 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:08:16] PROBLEM - IPMI Sensor Status on cp5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:10:47] PROBLEM - IPMI Sensor Status on dns5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:14:32] (03CR) 10Gergő Tisza: Move auth logging to different channels for easier counting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464077 (https://phabricator.wikimedia.org/T150300) (owner: 10Gergő Tisza) [17:15:36] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 72.07 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:15:56] RECOVERY - Check whether ferm is active by checking the default input chain on bast5001 is OK: OK ferm input default policy is set [17:16:06] RECOVERY - Check systemd state on bast5001 is OK: OK - running: The system is fully operational [17:18:48] (03CR) 10Smalyshev: [C: 04-1] "Not to be merged before we load the dump." [puppet] - 10https://gerrit.wikimedia.org/r/467097 (owner: 10Smalyshev) [17:20:38] 10Operations, 10Traffic: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10Volans) On `bast5001` ferm failed to start at reboot due to failed DNS resolution query. The next puppet runs didn't restart it. I had to manually start it. The host have been 55 minutes without ferm rules applie... [17:21:57] RECOVERY - Check whether ferm is active by checking the default input chain on dns5001 is OK: OK ferm input default policy is set [17:22:06] RECOVERY - Check systemd state on dns5001 is OK: OK - running: The system is fully operational [17:32:07] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:35:17] PROBLEM - IPMI Sensor Status on lvs5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:39:26] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 49 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:39:46] PROBLEM - IPMI Sensor Status on dns5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:40:04] 10Operations, 10Traffic, 10monitoring: Icinga: check_confd_vcl_reload unknown when file is missing - https://phabricator.wikimedia.org/T206950 (10Volans) p:05Triage>03Normal [17:44:07] 10Operations, 10Traffic: Puppet doesn't restart ferm on failure - https://phabricator.wikimedia.org/T206951 (10Volans) p:05Triage>03Normal [17:50:21] 10Operations, 10Traffic: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10Volans) Current status recap: - Maintenance on one power line is still ongoing, all servers are reported up and running, without icinga alarms but the loss of power redundancy. - JNX_ALARMS WARNING - 0 red alarms... [17:52:47] PROBLEM - IPMI Sensor Status on cp5007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:52:47] PROBLEM - IPMI Sensor Status on cp5011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:54:21] (03CR) 10Smalyshev: "ok, dumps are in" [puppet] - 10https://gerrit.wikimedia.org/r/467097 (owner: 10Smalyshev) [17:56:56] PROBLEM - IPMI Sensor Status on bast5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:57:16] PROBLEM - IPMI Sensor Status on cp5009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:58:35] (03PS6) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) [18:04:24] (03PS2) 10Urbanecm: Add shn to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/467080 (https://phabricator.wikimedia.org/T206777) [18:14:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 26 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:18:05] (03PS3) 10Urbanecm: Enable Translate on idwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460602 (https://phabricator.wikimedia.org/T204292) [18:22:06] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 63 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:39:37] (03PS1) 10Gergő Tisza: Disable AICaptcha data collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467139 (https://phabricator.wikimedia.org/T186244) [18:40:25] (03PS2) 10Smalyshev: Enable tracking lexemes in Updater [puppet] - 10https://gerrit.wikimedia.org/r/467097 [19:06:17] RECOVERY - Juniper alarms on cr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:07:36] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:08:26] RECOVERY - IPMI Sensor Status on cp5003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:10:06] RECOVERY - IPMI Sensor Status on dns5001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:11:06] RECOVERY - IPMI Sensor Status on dns5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:11:37] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:12:06] RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 232.09 ms [19:12:37] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [19:13:07] RECOVERY - Host ripe-atlas-eqsin is UP: PING OK - Packet loss = 0%, RTA = 243.02 ms [19:14:37] RECOVERY - IPMI Sensor Status on lvs5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:14:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 54 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:14:56] RECOVERY - IPMI Sensor Status on lvs5003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:16:47] RECOVERY - IPMI Sensor Status on cp5012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:16:47] RECOVERY - IPMI Sensor Status on cp5008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:21:16] RECOVERY - IPMI Sensor Status on cp5005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:22:46] RECOVERY - IPMI Sensor Status on cp5001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:22:46] RECOVERY - IPMI Sensor Status on cp5004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:22:46] RECOVERY - IPMI Sensor Status on cp5010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:22:46] RECOVERY - IPMI Sensor Status on cp5006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:22:47] RECOVERY - IPMI Sensor Status on cp5007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:22:47] RECOVERY - IPMI Sensor Status on cp5011 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:22:47] RECOVERY - IPMI Sensor Status on lvs5001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:22:48] RECOVERY - IPMI Sensor Status on cp5009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:22:48] RECOVERY - IPMI Sensor Status on bast5001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:22:49] RECOVERY - IPMI Sensor Status on cp5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [19:24:12] 10Operations, 10Traffic: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10Volans) It seems that the power has been restored, all the outstanding alarms have recovered and also the RIPE Atlas is back online. [19:24:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 27 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:37:16] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 38 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:57:58] 10Operations, 10Traffic: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) [19:59:35] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10faidon) @RobH ping? This has been pending since July, with the last update being Aug 27(!?) [20:01:10] (03PS1) 10Krinkle: tests: Make phpunit tests pass on PHP 7.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467141 [20:02:40] (03PS2) 10Krinkle: tests: Make phpunit tests pass on PHP 7.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467141 (https://phabricator.wikimedia.org/T176370) [20:02:49] (03CR) 10Krinkle: [C: 032] tests: Make phpunit tests pass on PHP 7.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467141 (https://phabricator.wikimedia.org/T176370) (owner: 10Krinkle) [20:03:21] 10Operations, 10Traffic: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) >>! In T206861#4664738, @Volans wrote: > It seems that the power has been restored, all the outstanding alarms have recovered and also the RIPE Atlas is back online. That's great! Note that the maintenance... [20:03:55] (03Merged) 10jenkins-bot: tests: Make phpunit tests pass on PHP 7.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467141 (https://phabricator.wikimedia.org/T176370) (owner: 10Krinkle) [20:07:59] (03CR) 10jenkins-bot: tests: Make phpunit tests pass on PHP 7.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467141 (https://phabricator.wikimedia.org/T176370) (owner: 10Krinkle) [20:08:17] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 227, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:09:22] what the fuck [20:10:09] XioNoX: around? [20:10:27] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:11:00] asw2-a FPC 7 rebooted [20:12:37] what [20:14:49] seems to have recovered, with the exception of [20:14:49] 2018-10-14 20:09:22 UTC Minor FPC 7 PEM 0 is not powered [20:21:42] can't find an obvious reason in the logs, will look more at it tomorrow [20:23:13] (03PS1) 10Krinkle: tests: Add tests for wmf-config/*Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467149 [20:23:31] and will follow up with cmjohnson1 tomorrow as well, to get that PEM back online [20:23:57] (03CR) 10Krinkle: [C: 032] tests: Add tests for wmf-config/*Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467149 (owner: 10Krinkle) [20:24:42] (03Merged) 10jenkins-bot: tests: Add tests for wmf-config/*Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467149 (owner: 10Krinkle) [20:25:43] (03CR) 10jenkins-bot: tests: Add tests for wmf-config/*Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467149 (owner: 10Krinkle) [20:27:46] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:34:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 49 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:42:40] (03PS1) 10Krinkle: import: Fix broken labs-only config in wmfImportSources() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467232 [20:42:42] (03PS1) 10Krinkle: multiversion: Remove reference to deprecated getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467233 (https://phabricator.wikimedia.org/T189966) [20:42:56] (03CR) 10Krinkle: [C: 032] "Beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467232 (owner: 10Krinkle) [20:44:02] (03Merged) 10jenkins-bot: import: Fix broken labs-only config in wmfImportSources() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467232 (owner: 10Krinkle) [20:45:21] (03PS2) 10Krinkle: multiversion: Remove reference to deprecated getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467233 (https://phabricator.wikimedia.org/T189966) [20:46:12] (03CR) 10Krinkle: [C: 04-1] "Two files match:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467233 (https://phabricator.wikimedia.org/T189966) (owner: 10Krinkle) [20:47:32] !log krinkle@deploy1001 Synchronized wmf-config/import.php: beta-only (duration: 00m 54s) [20:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:59] (03PS3) 10Krinkle: multiversion: Remove reference to deprecated getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467233 (https://phabricator.wikimedia.org/T189966) [21:00:01] (03CR) 10jenkins-bot: import: Fix broken labs-only config in wmfImportSources() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467232 (owner: 10Krinkle) [21:14:58] 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Ladsgroup) I did it for wikipedia-fa-tech [21:33:09] (03PS4) 10Krinkle: multiversion: Remove reference to deprecated getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467233 (https://phabricator.wikimedia.org/T189966) [21:33:50] (03CR) 10Krinkle: [C: 032] multiversion: Remove reference to deprecated getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467233 (https://phabricator.wikimedia.org/T189966) (owner: 10Krinkle) [21:34:52] (03Merged) 10jenkins-bot: multiversion: Remove reference to deprecated getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467233 (https://phabricator.wikimedia.org/T189966) (owner: 10Krinkle) [21:35:46] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:39:37] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1848 bytes in 0.035 second response time [21:39:46] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1847 bytes in 0.009 second response time [21:40:06] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1848 bytes in 0.056 second response time [21:42:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 55 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:43:51] (03PS1) 10Krinkle: multiversion: Move MWWikiversions include from MWRealm to MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467235 [21:43:59] (03CR) 10Krinkle: [C: 032] multiversion: Move MWWikiversions include from MWRealm to MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467235 (owner: 10Krinkle) [21:45:02] (03Merged) 10jenkins-bot: multiversion: Move MWWikiversions include from MWRealm to MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467235 (owner: 10Krinkle) [21:45:27] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 623 bytes in 0.151 second response time [21:46:07] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.057 second response time [21:46:16] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 76265 bytes in 0.116 second response time [21:48:10] (03CR) 10jenkins-bot: multiversion: Remove reference to deprecated getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467233 (https://phabricator.wikimedia.org/T189966) (owner: 10Krinkle) [21:48:12] (03CR) 10jenkins-bot: multiversion: Move MWWikiversions include from MWRealm to MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467235 (owner: 10Krinkle) [21:48:48] !log krinkle@deploy1001 Synchronized multiversion/MWMultiVersion.php: I83b2bdd53c13e (duration: 00m 50s) [21:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:18:27] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:22:56] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:23:36] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 27 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:27:06] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:28:27] The MW fatal spike is the same as 2 days ago, a user on Commons making semi-automated edits at a high frequency triggering lock contention on a certain user row [22:30:03] same user? [22:36:46] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 39 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:39:18] Krenair: No, I don't think so. [22:39:26] last time was a bot-like user name [22:39:36] This one is a user adding the same categories to a lot of files [22:40:53] (03PS1) 10Krinkle: services: Simplify structure of ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467237 [22:41:10] Krenair: Mind glancing over ^ to see if I missed anything? [22:43:09] ok [22:49:59] Krinkle, I noticed that only one actual host is in use for jobqueue_redis in each DC (in the current version, not just your diff), is that correct? [22:50:28] I guess so. It's awaiting decom given we use eventbus/kafka now for jq [22:50:37] it was scaled down [22:50:40] the old one [22:50:44] ah [22:52:48] (03CR) 10Krinkle: [C: 032] services: Simplify structure of ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467237 (owner: 10Krinkle) [22:52:55] (03PS1) 10Krinkle: Refactor wmf-config/env.php out of multiversion/MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467238 [22:52:57] (03PS1) 10Krinkle: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 [22:53:05] (03CR) 10Alex Monk: [C: 031] services: Simplify structure of ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467237 (owner: 10Krinkle) [22:53:10] Thanks :) [22:53:13] Krinkle, wait you're deploying this now? [22:53:36] Yeah, I want to get this done. It'll never get prioritised otherwise. [22:53:37] Also I just noticed you wrote 'May may' in the comment at the top [22:53:58] (03PS2) 10Krinkle: services: Simplify structure of ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467237 [22:53:58] thx [22:54:20] (03CR) 10Krinkle: [C: 032] services: Simplify structure of ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467237 (owner: 10Krinkle) [22:55:20] (03Merged) 10jenkins-bot: services: Simplify structure of ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467237 (owner: 10Krinkle) [22:55:51] (03Abandoned) 10Krinkle: [WIP] errorpages: Remove unused hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412829 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [22:55:57] (03Abandoned) 10Krinkle: mediawiki/hhvm: Move fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [22:56:16] (03PS2) 10Krinkle: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T113114) [22:56:30] (03CR) 10jerkins-bot: [V: 04-1] errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [23:04:17] 10Operations, 10Performance-Team, 10Availability: Perform a statsd and Graphite switch - https://phabricator.wikimedia.org/T206963 (10Krinkle) [23:06:36] (03PS2) 10Krinkle: Refactor wmf-config/env.php out of multiversion/MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467238 [23:06:38] (03PS3) 10Krinkle: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) [23:06:53] (03CR) 10jenkins-bot: services: Simplify structure of ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467237 (owner: 10Krinkle) [23:07:01] * Krinkle staging mwdebug1001 [23:11:43] OK. I should've done a full diff of the return value. [23:14:17] PROBLEM - Device not healthy -SMART- on dbstore1002 is CRITICAL: cluster=mysql device=megaraid,2 instance=dbstore1002:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbstore1002&var-datasource=eqiad%2520prometheus%252Fops [23:15:24] (03PS1) 10Krinkle: services: Array format for 'search' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467241 [23:15:39] (03CR) 10Krinkle: [C: 032] services: Array format for 'search' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467241 (owner: 10Krinkle) [23:16:39] (03Merged) 10jenkins-bot: services: Array format for 'search' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467241 (owner: 10Krinkle) [23:21:38] !log krinkle@deploy1001 Synchronized wmf-config/ProductionServices.php: If4d8faa4 (duration: 00m 48s) [23:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:04] (03CR) 10jenkins-bot: services: Array format for 'search' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467241 (owner: 10Krinkle) [23:27:07] PROBLEM - MegaRAID on dbstore1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [23:27:09] ACKNOWLEDGEMENT - MegaRAID on dbstore1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206965 [23:27:14] 10Operations, 10ops-eqiad: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10ops-monitoring-bot) [23:33:11] (03PS3) 10Krinkle: Refactor wmf-config/env.php out of multiversion/MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467238 [23:33:17] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:33:21] (03CR) 10jerkins-bot: [V: 04-1] Refactor wmf-config/env.php out of multiversion/MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467238 (owner: 10Krinkle) [23:34:44] (03PS4) 10Krinkle: Refactor wmf-config/env.php out of multiversion/MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467238 [23:39:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:42:47] !log krinkle@deploy1001 Synchronized multiversion/getMWVersion: Ice9a74e73481 no-op (duration: 00m 49s) [23:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:05] (03CR) 10Krinkle: [C: 032] Refactor wmf-config/env.php out of multiversion/MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467238 (owner: 10Krinkle) [23:46:43] (03CR) 10Krinkle: [C: 04-1] "To be cherry-picked to Beta and mwdebug1001 first for manual testing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) (owner: 10Krinkle) [23:47:06] (03Merged) 10jenkins-bot: Refactor wmf-config/env.php out of multiversion/MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467238 (owner: 10Krinkle) [23:52:16] (03CR) 10jenkins-bot: Refactor wmf-config/env.php out of multiversion/MWRealm.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467238 (owner: 10Krinkle) [23:57:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:57:58] * Krinkle canary staging on mwdebug1001 [23:58:01] * Krinkle canary staging on mw1278