[00:03:58] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o [00:03:58] e was received [00:04:48] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [00:57:19] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2025964 [01:06:59] PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:18] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [01:07:49] RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [01:08:18] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.039 second response time [01:11:50] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received [01:13:38] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:29:47] !log Decommissioning Cassandra, restbase1007-a.eqiad.wmnet (T179422) [01:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:55] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [02:11:08] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [02:11:58] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [02:18:58] PROBLEM - SSH on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:59] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:49] RECOVERY - SSH on scb1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [02:19:49] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [02:21:18] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [02:22:08] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [03:25:48] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 765.71 seconds [03:33:58] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:34:39] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:34:48] PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:55:58] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 105.35 seconds [04:03:49] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:04:39] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:04:48] RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:14:59] PROBLEM - Check Varnish expiry mailbox lag on cp4022 is CRITICAL: CRITICAL: expiry mailbox lag is 2110411 [05:15:19] PROBLEM - Check Varnish expiry mailbox lag on cp4025 is CRITICAL: CRITICAL: expiry mailbox lag is 2052325 [06:26:30] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5.1 in production - https://phabricator.wikimedia.org/T177891#3753068 (10Legoktm) [06:55:28] RECOVERY - Check Varnish expiry mailbox lag on cp4025 is OK: OK: expiry mailbox lag is 2 [08:04:09] PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:28] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:04:29] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:49] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:58] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:05:00] PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100% [08:05:00] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:00] PROBLEM - Host etcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:38] PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:08:08] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [08:08:08] RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [08:08:18] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [08:08:18] RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms [08:08:18] RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [08:08:18] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [08:08:18] RECOVERY - Host etcd1004 is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms [08:08:29] RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [08:10:28] RECOVERY - Host bohrium is UP: PING WARNING - Packet loss = 93%, RTA = 0.50 ms [08:11:49] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:12:48] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational [08:24:58] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:10:55] (03PS1) 10Nemo bis: [Planet Wikimedia] Remove EndPoint from English planet feeds [puppet] - 10https://gerrit.wikimedia.org/r/390873 [11:17:43] (03CR) 10Paladox: [Planet Wikimedia] Remove EndPoint from English planet feeds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [11:34:07] (03CR) 10Nemo bis: [Planet Wikimedia] Remove EndPoint from English planet feeds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [11:54:53] (03CR) 10Ori.livneh: [C: 031] StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [12:03:02] (03CR) 10Paladox: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [12:19:29] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:49:29] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:18:26] (03CR) 10Greg Sabino Mullane: "Sorry about that, the new URL should be:" [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [13:21:47] (03CR) 10Nemo bis: "Greg, and is there a corresponding RSS/Atom feed? I tried appending "/feed" to no avail." [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [14:45:08] RECOVERY - Check Varnish expiry mailbox lag on cp4022 is OK: OK: expiry mailbox lag is 1 [15:33:18] (03CR) 10Dzahn: [Planet Wikimedia] Remove EndPoint from English planet feeds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [15:34:51] (03CR) 10Dzahn: "@Greg Sabino Mullane There seems to be no feed URL or at least my browser can't detect one?" [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [16:08:53] We have an elevated number of 503s currently :/ [16:18:58] yeah seems to be mailbox lag in ulsfo cache_upload (again) [16:19:35] !log cp4026 - restart backend (mailbox lag) [16:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:38] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [17:01:59] PROBLEM - Disk space on install1002 is CRITICAL: DISK CRITICAL - free space: / 3018 MB (3% inode=98%) [17:29:28] (03PS1) 10Framawiki: Add images.collection.cooperhewitt.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390881 (https://phabricator.wikimedia.org/T180241) [17:34:24] (03CR) 10Zoranzoki21: [C: 031] Add images.collection.cooperhewitt.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390881 (https://phabricator.wikimedia.org/T180241) (owner: 10Framawiki) [18:16:19] (03PS1) 10BryanDavis: toolserver_legacy: redirect /~nikola/articlesby.php [puppet] - 10https://gerrit.wikimedia.org/r/390883 (https://phabricator.wikimedia.org/T179766) [18:54:49] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o [18:54:49] e was received [18:55:48] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [19:00:49] PROBLEM - Disk space on install1002 is CRITICAL: DISK CRITICAL - free space: / 3017 MB (3% inode=98%) [19:39:14] !log Decommissioning Cassandra, restbase1007-b.eqiad.wmnet (T179422) [19:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:23] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [20:14:39] RECOVERY - WDQS SPARQL on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 13857 bytes in 0.073 second response time [20:15:09] RECOVERY - WDQS HTTP on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 13857 bytes in 0.073 second response time [21:18:28] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o [21:18:28] e was received [21:18:48] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se [21:18:48] out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domai [21:18:48] d/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [21:19:48] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read artic [21:19:48] 2016 (with aggregated=true)) timed out before a response was received [21:20:19] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [21:20:48] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:20:48] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [22:50:28] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se [22:50:29] out before a response was received: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) timed out before a response was [22:50:29] }/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary d [22:51:28] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [23:01:19] (03CR) 10Krinkle: [C: 032] StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [23:02:28] (03Merged) 10jenkins-bot: StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [23:02:43] (03CR) 10jenkins-bot: StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [23:05:49] PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:35:49] RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:37:38] PROBLEM - HHVM rendering on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:28] RECOVERY - HHVM rendering on mw2130 is OK: HTTP OK: HTTP/1.1 200 OK - 75645 bytes in 0.306 second response time