[00:03:58] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o
[00:03:58] <icinga-wm>	 e was received
[00:04:48] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy
[00:57:19] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2025964
[01:06:59] <icinga-wm>	 PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:07:18] <icinga-wm>	 PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused
[01:07:49] <icinga-wm>	 RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[01:08:18] <icinga-wm>	 RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.039 second response time
[01:11:50] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received
[01:13:38] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[01:29:47] <urandom>	 !log Decommissioning Cassandra, restbase1007-a.eqiad.wmnet (T179422)
[01:29:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:29:55] <stashbot>	 T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422
[02:11:08] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received
[02:11:58] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy
[02:18:58] <icinga-wm>	 PROBLEM - SSH on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:18:59] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:19:49] <icinga-wm>	 RECOVERY - SSH on scb1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[02:19:49] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[02:21:18] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received
[02:22:08] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy
[03:25:48] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 765.71 seconds
[03:33:58] <icinga-wm>	 PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:34:39] <icinga-wm>	 PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:34:48] <icinga-wm>	 PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:55:58] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 105.35 seconds
[04:03:49] <icinga-wm>	 RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[04:04:39] <icinga-wm>	 RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[04:04:48] <icinga-wm>	 RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[05:14:59] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp4022 is CRITICAL: CRITICAL: expiry mailbox lag is 2110411
[05:15:19] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp4025 is CRITICAL: CRITICAL: expiry mailbox lag is 2052325
[06:26:30] <wikibugs>	 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 to 1.5.1 in production - https://phabricator.wikimedia.org/T177891#3753068 (10Legoktm)
[06:55:28] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp4025 is OK: OK: expiry mailbox lag is 2
[08:04:09] <icinga-wm>	 PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:28] <icinga-wm>	 PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:29] <icinga-wm>	 PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:49] <icinga-wm>	 PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:58] <icinga-wm>	 PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:00] <icinga-wm>	 PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:00] <icinga-wm>	 PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:00] <icinga-wm>	 PROBLEM - Host etcd1004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:38] <icinga-wm>	 PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:08:08] <icinga-wm>	 RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms
[08:08:08] <icinga-wm>	 RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms
[08:08:18] <icinga-wm>	 RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms
[08:08:18] <icinga-wm>	 RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms
[08:08:18] <icinga-wm>	 RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms
[08:08:18] <icinga-wm>	 RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms
[08:08:18] <icinga-wm>	 RECOVERY - Host etcd1004 is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms
[08:08:29] <icinga-wm>	 RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[08:10:28] <icinga-wm>	 RECOVERY - Host bohrium is UP: PING WARNING - Packet loss = 93%, RTA = 0.50 ms
[08:11:49] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[08:12:48] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational
[08:24:58] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:10:55] <wikibugs>	 (03PS1) 10Nemo bis: [Planet Wikimedia] Remove EndPoint from English planet feeds [puppet] - 10https://gerrit.wikimedia.org/r/390873
[11:17:43] <wikibugs>	 (03CR) 10Paladox: [Planet Wikimedia] Remove EndPoint from English planet feeds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis)
[11:34:07] <wikibugs>	 (03CR) 10Nemo bis: [Planet Wikimedia] Remove EndPoint from English planet feeds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis)
[11:54:53] <wikibugs>	 (03CR) 10Ori.livneh: [C: 031] StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle)
[12:03:02] <wikibugs>	 (03CR) 10Paladox: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis)
[12:19:29] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:49:29] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[13:18:26] <wikibugs>	 (03CR) 10Greg Sabino Mullane: "Sorry about that, the new URL should be:" [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis)
[13:21:47] <wikibugs>	 (03CR) 10Nemo bis: "Greg, and is there a corresponding RSS/Atom feed? I tried appending "/feed" to no avail." [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis)
[14:45:08] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp4022 is OK: OK: expiry mailbox lag is 1
[15:33:18] <wikibugs>	 (03CR) 10Dzahn: [Planet Wikimedia] Remove EndPoint from English planet feeds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis)
[15:34:51] <wikibugs>	 (03CR) 10Dzahn: "@Greg Sabino Mullane  There seems to be no feed URL or at least my browser can't detect one?" [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis)
[16:08:53] <hoo>	 We have an elevated number of 503s currently :/
[16:18:58] <bblack>	 yeah seems to be mailbox lag in ulsfo cache_upload (again)
[16:19:35] <bblack>	 !log cp4026 - restart backend (mailbox lag)
[16:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:38] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0
[17:01:59] <icinga-wm>	 PROBLEM - Disk space on install1002 is CRITICAL: DISK CRITICAL - free space: / 3018 MB (3% inode=98%)
[17:29:28] <wikibugs>	 (03PS1) 10Framawiki: Add images.collection.cooperhewitt.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390881 (https://phabricator.wikimedia.org/T180241)
[17:34:24] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] Add images.collection.cooperhewitt.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390881 (https://phabricator.wikimedia.org/T180241) (owner: 10Framawiki)
[18:16:19] <wikibugs>	 (03PS1) 10BryanDavis: toolserver_legacy: redirect /~nikola/articlesby.php [puppet] - 10https://gerrit.wikimedia.org/r/390883 (https://phabricator.wikimedia.org/T179766)
[18:54:49] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o
[18:54:49] <icinga-wm>	 e was received
[18:55:48] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy
[19:00:49] <icinga-wm>	 PROBLEM - Disk space on install1002 is CRITICAL: DISK CRITICAL - free space: / 3017 MB (3% inode=98%)
[19:39:14] <urandom>	 !log Decommissioning Cassandra, restbase1007-b.eqiad.wmnet (T179422)
[19:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:23] <stashbot>	 T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422
[20:14:39] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 13857 bytes in 0.073 second response time
[20:15:09] <icinga-wm>	 RECOVERY - WDQS HTTP on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 13857 bytes in 0.073 second response time
[21:18:28] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o
[21:18:28] <icinga-wm>	 e was received
[21:18:48] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se
[21:18:48] <icinga-wm>	 out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) timed out before a response was received: /{domai
[21:18:48] <icinga-wm>	 d/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[21:19:48] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) timed out before a response was received: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read artic
[21:19:48] <icinga-wm>	 2016 (with aggregated=true)) timed out before a response was received
[21:20:19] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy
[21:20:48] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[21:20:48] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[22:50:28] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se
[22:50:29] <icinga-wm>	 out before a response was received: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) timed out before a response was
[22:50:29] <icinga-wm>	 }/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary d
[22:51:28] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[23:01:19] <wikibugs>	 (03CR) 10Krinkle: [C: 032] StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle)
[23:02:28] <wikibugs>	 (03Merged) 10jenkins-bot: StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle)
[23:02:43] <wikibugs>	 (03CR) 10jenkins-bot: StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle)
[23:05:49] <icinga-wm>	 PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:35:49] <icinga-wm>	 RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:37:38] <icinga-wm>	 PROBLEM - HHVM rendering on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:38:28] <icinga-wm>	 RECOVERY - HHVM rendering on mw2130 is OK: HTTP OK: HTTP/1.1 200 OK - 75645 bytes in 0.306 second response time