In a connected world our software solutions increasingly consist of multiple, distributed components. Service Oriented Architectures are nothing new, and with the growing interest in Microservices it looks like the trend to distribute is going nowhere soon.
When dealing with remote components, there are two (very) broad categories of issues we need to consider:
- firstly, how we respond to failures in these remote systems, and
- secondly, how we manage our calls to these remote systems to keep our own systems performant, and keep latency to a minimum
What is Hystrix?
Hystrix is a library, originally developed by Netflix that lets you deal with issues with latency and fault-tolerance in complex, distributed systems. It has the expectation that failures of all kinds will occur baked into it – and provides a number of tools and solutions for dealing with different common scenarios.
Hystrix is open source and available on Github. In their own words – “Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.”
Demo Application – Betting Service
I’ve created a simple demo application – which is available over on Github. Click the link below:
The example application uses a simple BettingService to illustrate some of the tools available when using Hystrix to access a remote service.
The Betting Service has 3 methods we will use, exposed in the following interface:
BettingService.java
package com.cor.hysterix.service; import java.util.List; import com.cor.hysterix.domain.Horse; import com.cor.hysterix.domain.Racecourse; /** * Simulates the interface for a remote betting service. * */ public interface BettingService { /** * Get a list the names of all Race courses with races on today. * @return List of race course names */ List<Racecourse> getTodaysRaces(); /** * Get a list of all Horses running in a particular race. * @param race Name of race course * @return List of the names of all horses running in the specified race */ List<Horse> getHorsesInRace(String raceCourseId); /** * Get current odds for a particular horse in a specific race today. * @param race Name of race course * @param horse Name of horse * @return Current odds as a string (e.g. "10/1") */ String getOddsForHorse(String raceCourseId, String horseId); }
The service exposes 2 domain objects – Racecourse and Horse:
Racecourse.java
package com.cor.hysterix.domain; import org.apache.commons.lang.builder.EqualsBuilder; import org.apache.commons.lang.builder.HashCodeBuilder; public class Racecourse { private String id; private String name; public Racecourse(String id, String name) { super(); this.id = id; this.name = name; } public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } @Override public String toString() { return "Racecourse [id=" + id + ", name=" + name + "]"; } @Override public int hashCode() { return HashCodeBuilder.reflectionHashCode(this); } @Override public boolean equals(Object obj) { return EqualsBuilder.reflectionEquals(this, obj); } }
Horse.java
package com.cor.hysterix.domain; import org.apache.commons.lang.builder.EqualsBuilder; import org.apache.commons.lang.builder.HashCodeBuilder; public class Horse { private String id; private String name; public Horse(String id, String name) { super(); this.id = id; this.name = name; } public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } @Override public String toString() { return "Horse [id=" + id + ", name=" + name + "]"; } @Override public int hashCode() { return HashCodeBuilder.reflectionHashCode(this); } @Override public boolean equals(Object obj) { return EqualsBuilder.reflectionEquals(this, obj); } }
Hysterix’s Approach
Handling Failures
When an application is highly dependent on other services there is a danger that any failure in those systems could have a detrimental effect on our own calling application. As developers of the calling application, the root causes of these problems are largely out of our control, but how we deal with them is critical to the efficient running of our own systems.
Hysterix is largely built around using the Command design pattern. Each remote service call is wrapped in a HysterixCommand (for synchronous calls), or a HysterixObserveableCommand (for asynchronous calls)
Simple Failures
In the simplest of scenarios – the case of the remote service being unavailable, or throwing an exception, the HysterixCommand lets you define what action you wish to take in the getFallback() method (this may be to propogate the exception, return a default setting, or a special error value).
Circuit Breaker – Failing Fast
Calls to a remote system can be costly – even if the system returns an error, there may be a period of latency waiting for the error to be returned. This can introduce latency into the calling system, and also hog resources – possibly even using up all the threads in a thread pool (see more on this below).
Hystrix introduces the concept of a Circuit Breaker to facilitate failing fast. If it detects a number of similar failures in rapid succession (some combination of volume and error theshold) – it can trip a Circuit Breaker, forcing all subsequent calls to fail fast without making remote calls to the remote server (calling through the HystricCommands getFallback() method as described above). After some configurable period it will close the circuit again and begin processing remote calls (again, if it still detects an error state, it will re-open the circuit breaker, and so-on). This avoids us blindly calling the remote system when we know there is obviously an issue with it – removing latency and allowing us to deal with it in a pre-defined method – which may include returning cached data, or failing over to a backup system.
Latency and Thread Pools
As touched on above, a much more insidious side of failures (or even successful, but slow calls) is the potential to use up all the threads in the calling pool. As each call blocks a thread while waiting for a response from the remote system, new calls use up more threads from the pool. If that pool is not isolated from the main application server pool – it has the potential to make the whole calling application grind to a halt.
Hystrix allows use of its own thread pool(s) – so the calling applications threads are never in danger of being used up, allowing it to shut down calls that are taking too long. Different commands, or groups of commands, can be configured with their own thread pools, so it is possible to isolate different sets of service calls (e.g. if remote service A has latency issues, there is no leakage of impact into calls to remote service B). This helps isolate and manage calls across different client libraries and networks.
However, there is an overhead to using thread pools. Netflix did their own metrics (with 10+ billion Hystrix Command executions per day using thread isolation), and concluded that the overhead on their systems was so small the isolation benefits outweighed it.
The architecture used by Hystrix is illustrated in the diagram below. Each dependency is isolated from one other, restricted in the resources it can saturate when latency occurs, and covered in fallback logic that decides what response to make when any type of failure occurs in the dependency:
Coding Client Calls Efficiently
The flipside to responding to failures is coding remote calls as efficiently as possible to try and minimise latency/load issues.
Again, Hystrix provides some tools in the library to help deal with this.
Request Collapsing
If you imagine multiple uses on our client application using the system concurrently, it is likely that some users actions will cause similar requests to be sent to a remote service almost simultaneously (e.g. get stock price for Apple Inc NASDAQ: AAPL). Without any intervention – this would result in virtually duplicate requests being sent across the network in almost real time.
Request collapsing introduces a small wait period between when the remote service call is requested, and when it is executed – to see if any duplicate requests are made in the window (this window is usually very small – e.g. 10ms). There is the option to use ‘Global’context (i.e. collapse all requests for any user on any Tomcat thread), or ‘User Request’ context (i.e. collapse all requests for a single Tomcat thread).
Although a useful tool – request collapsing is really only benficial in scenarios where a particular command is called with great frequency, allowing it to batch tens or even hundreds of calls together, reducing threads and network load. This is especially true for high latency calls, where the overhead of waiting a few milliseconds for the collapsing window makes no real impact on the overall response time of the call.
For infrequent, or low latency required commands, collapsings cost will outweigh its benefits.
Request Caching
Request caching helps ensure that different threads don’t make duplicate/redundant calls to the external service. Because the cache sits in front of the construct/run methods it means that Hystrix can return cached results before creating and executing a new thread for each duplicate request. Caching can easily be added to a HystrixCommand or HystrixObserveableCommand by simply overriding the getCacheKey() method (this will provide the key Hystrix uses to determine if a subsequent request to the Command is a duplicate of an earlier one).
Examples from the Betting Service
Calling ‘getTodaysRaces’
This is a call a client can make on the Betting Service to get a list of all races available that day. Just the simplest example of extending a HystrixCommand to access a remote service (see Unit Tests below for examples of how to call it).
CommandGetTodaysRaces.java
package com.cor.hysterix.command; import java.util.ArrayList; import java.util.List; import com.cor.hysterix.domain.Racecourse; import com.cor.hysterix.exception.RemoteServiceException; import com.cor.hysterix.service.BettingService; import com.netflix.hystrix.HystrixCommand; import com.netflix.hystrix.HystrixCommandGroupKey; import com.netflix.hystrix.HystrixThreadPoolKey; /** * Get a list of all Race courses with races on today. * */ public class CommandGetTodaysRaces extends HystrixCommand<List<Racecourse>> { private final BettingService service; private final boolean failSilently; /** * CommandGetTodaysRaces * * @param service * Remote Broker Service * @param failSilently * If <code>true</code> will return an empty list if a remote service exception is thrown, if * <code>false</code> will throw a BettingServiceException. */ public CommandGetTodaysRaces(BettingService service, boolean failSilently) { super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("BettingServiceGroup")) .andThreadPoolKey( HystrixThreadPoolKey.Factory.asKey("BettingServicePool"))); this.service = service; this.failSilently = failSilently; } public CommandGetTodaysRaces(BettingService service) { this(service, true); } @Override protected List<Racecourse> run() { return service.getTodaysRaces(); } @Override protected List<Racecourse> getFallback() { // can log here, throw exception or return default if (failSilently) { return new ArrayList<Racecourse>(); } else { throw new RemoteServiceException("Unexpected error retrieving todays races"); } } }
Calling ‘GetHorsesInRaceWithCaching’
This is a call a client can make on the Betting Service to get a list of all horses running in a particular race. I have used this as an example of how to implement caching by overriding the getCacheKey() method on HystrixCommand. In this example we are using the raceId as the cache key. Details of the cache, expiry etc can be configured in detail (but that is beyond the scope of this particular article).
CommandGetHorsesInRaceWithCaching.java
package com.cor.hysterix.command.caching; import java.util.ArrayList; import java.util.List; import com.cor.hysterix.domain.Horse; import com.cor.hysterix.service.BettingService; import com.netflix.hystrix.HystrixCommand; import com.netflix.hystrix.HystrixCommandGroupKey; import com.netflix.hystrix.HystrixThreadPoolKey; /** * Get List of Horses in a specific race. Note, calls via this command are cached. * */ public class CommandGetHorsesInRaceWithCaching extends HystrixCommand<List<Horse>> { private final BettingService service; private final String raceCourseId; /** * CommandGetHorsesInRaceWithCaching. * @param service * Remote Broker Service * @param raceCourseId * Id of race course */ public CommandGetHorsesInRaceWithCaching(BettingService service, String raceCourseId) { super(Setter.withGroupKey( HystrixCommandGroupKey.Factory.asKey("BettingServiceGroup")) .andThreadPoolKey( HystrixThreadPoolKey.Factory.asKey("BettingServicePool"))); this.service = service; this.raceCourseId = raceCourseId; } @Override protected List<Horse> run() { return service.getHorsesInRace(raceCourseId); } @Override protected List<Horse> getFallback() { // can log here, throw exception or return default return new ArrayList<Horse>(); } @Override protected String getCacheKey() { return raceCourseId; } }
Here are some unit tests illustrating the use of this command in a variety of ways:
- testSynchronous() – shows a basic synchronous call to use the command
- testSynchronousFailSilently() – shows a basic synchronous call which captures an Exception from the remote service, swallows it and returns an empty list
- testSynchronousFailFast() – shows a basic synchronous call which captures an Exception from the remote service, and throws a new RemoteServiceException
- testAsynchronous() – shows a basic asynchronous call using Futures to use the same command
- testObservable() – shows a basic asynchronous call using Observables to use the same command
- testWithCacheHits() – illustrates how the Command caches the response from the server, thus reducing unnecessary remote calls
package com.cor.hysterix; import static org.junit.Assert.*; import static org.mockito.Mockito.mock; import static org.mockito.Mockito.when; import static org.mockito.Mockito.verify; import static org.mockito.Mockito.verifyNoMoreInteractions; import java.util.ArrayList; import java.util.Arrays; import java.util.List; import java.util.concurrent.Future; import org.junit.Before; import org.junit.Test; import rx.Observable; import com.cor.hysterix.command.CommandGetTodaysRaces; import com.cor.hysterix.command.batching.CommandCollapserGetOddsForHorse; import com.cor.hysterix.command.batching.GetOddsForHorseRequest; import com.cor.hysterix.command.caching.CommandGetHorsesInRaceWithCaching; import com.cor.hysterix.domain.Horse; import com.cor.hysterix.domain.Racecourse; import com.cor.hysterix.exception.RemoteServiceException; import com.cor.hysterix.service.BettingService; import com.netflix.hystrix.HystrixCommand; import com.netflix.hystrix.HystrixCommandKey; import com.netflix.hystrix.HystrixEventType; import com.netflix.hystrix.HystrixRequestCache; import com.netflix.hystrix.HystrixRequestLog; import com.netflix.hystrix.exception.HystrixRuntimeException; import com.netflix.hystrix.strategy.concurrency.HystrixConcurrencyStrategyDefault; import com.netflix.hystrix.strategy.concurrency.HystrixRequestContext; /** * Unit Test Suite that illustrates various approaches in using Hystrix Commands to access remote services. * * Uses Mockito to wire in a mock remote betting service. * */ public class BettingServiceTest { private static final String RACE_1 = "course_aintree"; private static final String HORSE_1 = "horse_redrum"; private static final String HORSE_2 = "horse_shergar"; private static final String ODDS_RACE_1_HORSE_1 = "10/1"; private static final String ODDS_RACE_1_HORSE_2 = "100/1"; private static final HystrixCommandKey GETTER_KEY = HystrixCommandKey.Factory.asKey("GetterCommand"); private BettingService mockService; /** * Set up the shared Unit Test environment */ @Before public void setUp() { mockService = mock(BettingService.class); when(mockService.getTodaysRaces()).thenReturn(getRaceCourses()); when(mockService.getHorsesInRace(RACE_1)).thenReturn(getHorsesAtAintree()); when(mockService.getOddsForHorse(RACE_1, HORSE_1)).thenReturn(ODDS_RACE_1_HORSE_1); when(mockService.getOddsForHorse(RACE_1, HORSE_2)).thenReturn(ODDS_RACE_1_HORSE_2); } /** * Command GetRaces - Execute (synchronous call). */ @Test public void testSynchronous() { CommandGetTodaysRaces commandGetRaces = new CommandGetTodaysRaces(mockService); assertEquals(getRaceCourses(), commandGetRaces.execute()); verify(mockService).getTodaysRaces(); verifyNoMoreInteractions(mockService); } /** * Command GetRaces - Execute and Fail Silently. * Swallows remote server error and returns an empty list. */ @Test public void testSynchronousFailSilently() { CommandGetTodaysRaces commandGetRacesFailure = new CommandGetTodaysRaces(mockService); // override mock to mimic an error being thrown for this test when(mockService.getTodaysRaces()).thenThrow(new RuntimeException("Error!!")); assertEquals(new ArrayList<Racecourse>(), commandGetRacesFailure.execute()); // Verify verify(mockService).getTodaysRaces(); verifyNoMoreInteractions(mockService); } /** * Command GetRaces - Execute and Fail Fast. * Catches remote server error and throws a new Exception. */ @Test public void testSynchronousFailFast() { CommandGetTodaysRaces commandGetRacesFailure = new CommandGetTodaysRaces(mockService, false); // override mock to mimic an error being thrown for this test when(mockService.getTodaysRaces()).thenThrow(new RuntimeException("Error!!")); try{ commandGetRacesFailure.execute(); }catch(HystrixRuntimeException hre){ assertEquals(RemoteServiceException.class, hre.getFallbackException().getClass()); } verify(mockService).getTodaysRaces(); verifyNoMoreInteractions(mockService); } /** * Command GetRaces - Queue (Asynchronous) */ @Test public void testAsynchronous() throws Exception { CommandGetTodaysRaces commandGetRaces = new CommandGetTodaysRaces(mockService); Future<List<Racecourse>> future = commandGetRaces.queue(); assertEquals(getRaceCourses(), future.get()); verify(mockService).getTodaysRaces(); verifyNoMoreInteractions(mockService); } /** * Command - Observe (Hot Observable) */ @Test public void testObservable() throws Exception { CommandGetTodaysRaces commandGetRaces = new CommandGetTodaysRaces(mockService); Observable<List<Racecourse>> observable = commandGetRaces.observe(); // blocking observable assertEquals(getRaceCourses(), observable.toBlocking().single()); verify(mockService).getTodaysRaces(); verifyNoMoreInteractions(mockService); } /** * Test - GetHorsesInRace - Uses Caching */ @Test public void testWithCacheHits() { HystrixRequestContext context = HystrixRequestContext.initializeContext(); try { CommandGetHorsesInRaceWithCaching commandFirst = new CommandGetHorsesInRaceWithCaching(mockService, RACE_1); CommandGetHorsesInRaceWithCaching commandSecond = new CommandGetHorsesInRaceWithCaching(mockService, RACE_1); commandFirst.execute(); // this is the first time we've executed this command with // the value of "2" so it should not be from cache assertFalse(commandFirst.isResponseFromCache()); verify(mockService).getHorsesInRace(RACE_1); verifyNoMoreInteractions(mockService); commandSecond.execute(); // this is the second time we've executed this command with // the same value so it should return from cache assertTrue(commandSecond.isResponseFromCache()); } finally { context.shutdown(); } // start a new request context context = HystrixRequestContext.initializeContext(); try { CommandGetHorsesInRaceWithCaching commandThree = new CommandGetHorsesInRaceWithCaching(mockService, RACE_1); commandThree.execute(); // this is a new request context so this // should not come from cache assertFalse(commandThree.isResponseFromCache()); // Flush the cache HystrixRequestCache.getInstance(GETTER_KEY, HystrixConcurrencyStrategyDefault.getInstance()).clear(RACE_1); } finally { context.shutdown(); } } private List<Racecourse> getRaceCourses(){ Racecourse course1 = new Racecourse(RACE_1, "Aintree"); return Arrays.asList(course1); } private List<Horse> getHorsesAtAintree(){ Horse horse1 = new Horse(HORSE_1, "Red Rum"); Horse horse2 = new Horse(HORSE_2, "Shergar"); return Arrays.asList(horse1, horse2); } // more tests here - see below }
Calling ‘GetOddsForHorse’
Now this last command is a bit more advanced – utilising batching and request collapsing. As this may be called repeatedly in rapid succession – and we aren’t too concerned if their is some very minor latency in the response – we don’t want to flood the service with similar requests. Using collapsing will introduce a very slight (millseconds) delay after a command is executed so it can wait to see if the same request comes through again. If it does it can make a single call to the service and return the same result to all requesters. Again, the details of this are configurable.
The classes below show how this is done for our particular example, and their is also a Unit Test which exercises this and confirms it is collapsing requests.
GetOddsForHorseRequest.java
package com.cor.hysterix.command.batching; /** * Object to wrap the parameters required by the method call {@link BettingService#getOddsForHorse}. */ public class GetOddsForHorseRequest { private final String raceCourseId; private final String horseId; public GetOddsForHorseRequest(String raceCourseId, String horseId){ this.raceCourseId = raceCourseId; this.horseId = horseId; } public String getRaceCourseId() { return raceCourseId; } public String getHorseId() { return horseId; } }
BatchCommandGetOddsForHorse.java
package com.cor.hysterix.command.batching; import java.util.ArrayList; import java.util.Collection; import java.util.List; import com.cor.hysterix.service.BettingService; import com.netflix.hystrix.HystrixCommand; import com.netflix.hystrix.HystrixCommandGroupKey; import com.netflix.hystrix.HystrixThreadPoolKey; import com.netflix.hystrix.HystrixCollapser.CollapsedRequest; public class BatchCommandGetOddsForHorse extends HystrixCommand<List<String>> { private final Collection<CollapsedRequest<String, GetOddsForHorseRequest>> requests; private BettingService service; public BatchCommandGetOddsForHorse(Collection<CollapsedRequest<String, GetOddsForHorseRequest>> requests) { super(Setter.withGroupKey( HystrixCommandGroupKey.Factory.asKey("BettingServiceGroup")) .andThreadPoolKey( HystrixThreadPoolKey.Factory.asKey("GetOddsPool"))); this.requests = requests; } @Override protected List<String> run() { ArrayList<String> response = new ArrayList<String>(); for (CollapsedRequest<String, GetOddsForHorseRequest> request : requests) { GetOddsForHorseRequest callRequest = request.getArgument(); response.add(service.getOddsForHorse(callRequest.getRaceCourseId(), callRequest.getHorseId())); } return response; } public void setService(BettingService service) { this.service = service; } }
CommandCollapserGetOddsForHorse.java
package com.cor.hysterix.command.batching; import java.util.Collection; import java.util.List; import com.cor.hysterix.service.BettingService; import com.netflix.hystrix.HystrixCollapser; import com.netflix.hystrix.HystrixCommand; public class CommandCollapserGetOddsForHorse extends HystrixCollapser<List<String>, String, GetOddsForHorseRequest> { private final GetOddsForHorseRequest key; private BettingService service; public CommandCollapserGetOddsForHorse(GetOddsForHorseRequest key) { this.key = key; } @Override public GetOddsForHorseRequest getRequestArgument() { return key; } @Override protected HystrixCommand<List<String>> createCommand(final Collection<CollapsedRequest<String, GetOddsForHorseRequest>> requests) { BatchCommandGetOddsForHorse command = new BatchCommandGetOddsForHorse(requests); command.setService(service); return command; } @Override protected void mapResponseToRequests(List<String> batchResponse, Collection<CollapsedRequest<String, GetOddsForHorseRequest>> requests) { int count = 0; for (CollapsedRequest<String, GetOddsForHorseRequest> request : requests) { request.setResponse(batchResponse.get(count++)); } } public void setService(BettingService service) { this.service = service; } }
Unit Test ‘CommandCollapserGetOddsForHorse’
The code below shows the Unit Test to check that requests are being collapsed when executing this particular command.
/** * Request Collapsing */ @SuppressWarnings("deprecation") @Test public void testCollapser() throws Exception { HystrixRequestContext context = HystrixRequestContext.initializeContext(); CommandCollapserGetOddsForHorse c1 = new CommandCollapserGetOddsForHorse(new GetOddsForHorseRequest(RACE_1, HORSE_1)); CommandCollapserGetOddsForHorse c2 = new CommandCollapserGetOddsForHorse(new GetOddsForHorseRequest(RACE_1, HORSE_1)); CommandCollapserGetOddsForHorse c3 = new CommandCollapserGetOddsForHorse(new GetOddsForHorseRequest(RACE_1, HORSE_1)); CommandCollapserGetOddsForHorse c4 = new CommandCollapserGetOddsForHorse(new GetOddsForHorseRequest(RACE_1, HORSE_1)); c1.setService(mockService); c2.setService(mockService); c3.setService(mockService); c4.setService(mockService); try { Future<String> f1 = c1.queue(); Future<String> f2 = c2.queue(); Future<String> f3 = c3.queue(); Future<String> f4 = c4.queue(); assertEquals(ODDS_RACE_1_HORSE_1, f1.get()); assertEquals(ODDS_RACE_1_HORSE_1, f2.get()); assertEquals(ODDS_RACE_1_HORSE_1, f3.get()); assertEquals(ODDS_RACE_1_HORSE_1, f4.get()); // assert that the batch command 'BatchCommandGetOddsForHorse' was in fact // executed and that it executed only once assertEquals(1, HystrixRequestLog.getCurrentRequest().getExecutedCommands().size()); HystrixCommand<?> command = HystrixRequestLog.getCurrentRequest().getExecutedCommands() .toArray(new HystrixCommand<?>[1])[0]; // assert the command is the one we're expecting assertEquals("BatchCommandGetOddsForHorse", command.getCommandKey().name()); // confirm that it was a COLLAPSED command execution assertTrue(command.getExecutionEvents().contains(HystrixEventType.COLLAPSED)); // and that it was successful assertTrue(command.getExecutionEvents().contains(HystrixEventType.SUCCESS)); } finally { context.shutdown(); } }
Conclusion
If you are doing any kind of work where you are accessing remote services – I’d definitely recommend checking Hystrix out. At the most basic level – it forces you to think about all the problems that can occur when accessing services early in the development phase – when solutions are most easy to implement. All too often these calls are developed in an ‘ideal world’ environment, with problems only becoming apparent when they move into user acceptance testing, or, worse still, production.
This article is just an attempt to try out a few of the approached used by Hystrix – and give a working project to help get started with playing around with it. Check it out on Github – run the Unit Test to exercise the mocked service calls.
Hystrix itself is open source – and can be found on Github – with a wiki and more detailed explanations, here: https://github.com/Netflix/Hystrix/
So just…