This post is somewhat of PSA about using the excellent open source Polly library for handling resiliency to your application. Recently, I was tasked with adding a circuit-breaker implementation to some code calling an external API, and I figured Polly would be perfect, especially as we already used it in our solution!
I hadn't used Polly directly in a little while, but the excellent design makes it easy to add retry handling, timeouts, or circuit-breaking to your application. Unfortunately, my initial implementation had one particular flaw, which meant that my circuit-breaker never actually worked!
In this post I'll outline the scenario I was working with, my initial implementation, the subsequent issues, and what I should have done!
tl;dr;
Policy
is thread safe, and for the circuit-breaker to work correctly, it must be shared so that you callExecute
on the samePolicy
instance every time!
The scenario - dealing with a flakey external API
A common requirement when working with currencies is dealing with exchange rates. We have happily been using the Open Exchange Rates API to fetch a JSON list of exchange rates for a while now.
The existing implementation consists of three classes:
OpenExchangeRatesClient
- Responsible for fetching the exchange rates from the API, and parsing the JSON into a strongly typed .NET object.OpenExchangeRatesCache
- We don't want to fetch exchange rates every time we need them, so this class caches the latest exchange rates for a day before calling theOpenExchangeRatesClient
to get up-to-date rates.FallbackExchangeRateProvider
- If the call to fetch the latest rates using theOpenExchangeRatesClient
fails, we fallback to a somewhat recent copy of the data, loaded from an embedded resource in the assembly.
All of these classes are registered as Singletons with the IoC container, so there isonly a single instance of each. This setup has been working fine for a while, but there was an issue where the Open Exchange Rates API went down, just as the local cache of exchange rates expired. The series of events was:
- A request was made to our internal API, which called a service that required exchange rates.
- The service called the
OpenExchangeRatesCache
which realised the current rates were out of date. - The cache called the
OpenExchangeRatesClient
to fetch the latest rates. - Unfortunately the service was down, and eventually caused a timeout (after 100 seconds!)
- At this point the cache used the
FallbackExchangeRateProvider
to use the stale rates for this single request. - A separate request was made to our internal API - repeat steps 2-6!
An issue in the external dependency, the exchange rate API going down, was causing our internal services to take 100s to respond to requests, which in turn was causing other requests to timeout. Effectively we had a cascading failure, even though we thought we had accounted for this by providing a fallback.
Note I realise updating cached exchange rates should probably be a background task. This would stop requests failing if there are issues updating, but the general problem is common to many scenarios, especially if you're using micro-services.
Luckily, this outage didn't happen at a peak time, so by the time we came to investigate the issue, the problem had passed, and relatively few people were affected. However, it obviously flagged up a problem, so I set about trying to ensure this wouldn't happen again if the API had issues at a later date!
Fix 1 - Reduce the timeouts
The first fix was a relatively simple one. The OpenExchangeRatesClient
was using an HttpClient
to call the API and fetch the exchange rate data. This was instantiated in the constructor, and reused for the lifetime of the class. As the client was used as a singleton, the HttpClient
was also a singleton (so we didn't have any of these issues).
public class OpenExchangeRatesClient
{
private readonly HttpClient _client;
OpenExchangeRatesClient(string apiUrl)
{
_client = new HttpClient
{
BaseAddress = new Uri(apiUrl),
};
}
}
The first fix I made was to set the Timeout
property on the HttpClient
. In the failure scenario, it was taking 100s to get back an error response. Why 100s? Because that's the default timeout for HttpClient
!
Checking our metrics of previous calls to the service, I could see that prior to the failure, virtually all calls were taking approximately 0.25s. Based on that, a 100s timeout was clearly overkill! Setting the timeout to something more modest, but still conservative, say 5s, should help prevent the scenario happening again.
public class OpenExchangeRatesClient
{
private readonly HttpClient _client;
OpenExchangeRatesClient(string apiUrl)
{
_client = new HttpClient
{
BaseAddress = new Uri(apiUrl),
Timeout = TimeSpan.FromSeconds(5),
};
}
}
Fix 2 - Add a circuit breaker
The second fix was to add a circuit-breaker implementation to the API calls. The Polly documentation has a great explanation of the circuit-breaker pattern, but I'll give a brief summary here.
Circuit-breakers in brief
Circuit-breakers make sense when calling a somewhat unreliable API. They use a fail-fast approach when a method has failed several times in a row. As an example, in my scenario, there was no point repeatedly calling the API when it hadn't worked several times in a row, and was very likely to fail. All we were doing was adding additional delays to the method calls, when it's pretty likely you're going to have to use the fallback anyway.
The circuit-breaker tracks the number of times an API call has failed. Once it crosses a threshold number of failures in a row, it doesn't even try to call the API for subsequent requests. Instead, it fails immediately, as though the API had failed.
After some timeout, the circuit-breaker will let one method call through to "test" the API and see if it succeeds. If it fails, it goes back to just failing immediately. If it succeeds then the circuit is closed again, and it will go back to calling the API for every request.
Circuit breaker state diagram taken from the Polly documentation
The circuit-breaker was a perfect fit for the failure scenario in our app, so I set about adding it to the OpenExchangeRatesClient
.
Creating a circuit breaker policy
You can create a circuit-breaker Policy
in Polly using the CircuitBreakerSyntax
. As we're going to be making requests with the HttpClient
, I used the async methods for setting up the policy and for calling the API:
var circuitBreaker = Policy
.Handle<Exception>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 2,
durationOfBreak: TimeSpan.FromMinutes(1)
);
var rates = await circuitBreaker
.ExecuteAsync(() => CallRatesApi());
This configuration creates a new circuit breaker policy, defines the number of consecutive exceptions to allow before marking the API as broken and opening the breaker, and the amount of time the breaker should stay open for before moving to the half-closed state.
Once you have a policy in place, circuitBreaker
, you can call ExecuteAsync
and pass in the method to execute. At runtime, if an exception occurs executing CallRatesApi()
the circuit breaker will catch it, and keep track of how many exceptions it has raised to control the breaker's state.
Adding a fallback
When an exception occurs in the CallRatesApi()
method, the breaker will catch it, but it will re-throw the exception. In my case, I wanted to catch those exceptions and use the FallbackExchangeRateProvider
. I could have used a try-catch
block, but I decided to stay in the Polly-spirit and use a Fallback
policy.
A fallback policy is effectively a try catch block - it simply executes an alternative method if CallRatesApi()
throws. You can then wrap the fallback policy around the breaker policy to combine the two. If the circuit breaker fails, the fallback will run instead:
var circuitBreaker = Policy
.Handle<Exception>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 2,
durationOfBreak: TimeSpan.FromMinutes(1)
);
var fallback = Policy
.Handle<Exception>()
.FallbackAsync(()=> GetFallbackRates())
.WrapAsync(circuitBreaker);
var results = await fallback
.ExecuteAsync(() => CallRatesApi());
Putting it together - my failed attempt!
This all looked like it would work as best I could see, so I set about replacing the OpenExchangeRatesClient
implementation, and testing it out.
**Note ** This isn't correct, don't copy it!
public class OpenExchangeRatesClient
{
private readonly HttpClient _client;
public OpenExchangeRatesClient(string apiUrl)
{
_client = new HttpClient
{
BaseAddress = new Uri(apiUrl),
};
}
public Task<ExchangeRates> GetLatestRates()
{
var circuitBreaker = Policy
.Handle<Exception>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 2,
durationOfBreak: TimeSpan.FromMinutes(1)
);
var fallback = Policy
.Handle<Exception>()
.FallbackAsync(() => GetFallbackRates())
.WrapAsync(circuitBreaker);
return fallback
.ExecuteAsync(() => CallRatesApi());
}
public Task<ExchangeRates> CallRatesApi()
{
//call the API, parse the results
}
public Task<ExchangeRates> GetFallbackRates()
{
// load the rates from the embedded file and parse them
}
}
In theory, this is the flow I was aiming for when the API goes down:
- Call
GetLatestRates()
->CallRatesApi()
throws -> Uses Fallback - Call
GetLatestRates()
->CallRatesApi()
throws -> Uses Fallback - Call
GetLatestRates()
-> skipsCallRatesApi()
-> Uses Fallback - Call
GetLatestRates()
-> skipsCallRatesApi()
-> Uses Fallback - ... etc
What I actually saw was:
- Call
GetLatestRates()
->CallRatesApi()
throws -> Uses Fallback - Call
GetLatestRates()
->CallRatesApi()
throws -> Uses Fallback - Call
GetLatestRates()
->CallRatesApi()
throws -> Uses Fallback - Call
GetLatestRates()
->CallRatesApi()
throws -> Uses Fallback - ... etc
It was as though the circuit breaker wasn't there at all! No matter how many times the CallRatesApi()
method threw, the circuit was never breaking.
Can you see what I did wrong?
Using circuit breakers properly
Every time you call GetLatestRates()
, I'm creating a new circuit breaker (and fallback) policy, and then calling ExecuteAsync
on that!
The circuit breaker, by it's nature, has state that must be persisted between calls (the number of exceptions that have previously happened, the open/closed state of the breaker etc). By creating new Policy
objects inside the GetLatestRates()
method, I was effectively resetting the policy back to its initial state, hence why nothing was working!
The answer is simple - make sure the Policy persists between calls to GetLatestRates()
so that its state persists. The Policy
is thread safe, so there's no issues to worry about there either. As the client is implemented in our app as a singleton, I simply moved the policy configuration to the class constructor, and everything proceeded to work as expected!
public class OpenExchangeRatesClient
{
private readonly HttpClient _client;
private readonly Policy _policy;
public OpenExchangeRatesClient(string apiUrl)
{
_client = new HttpClient
{
BaseAddress = new Uri(apiUrl),
};
var circuitBreaker = Policy
.Handle<Exception>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 2,
durationOfBreak: TimeSpan.FromMinutes(1)
);
_policy = Policy
.Handle<Exception>()
.FallbackAsync(() => GetFallbackRates())
.Wrap(circuitBreaker);
}
public Task<ExchangeRates> GetLatestRates()
{
return _policy
.ExecuteAsync(() => CallRatesApi());
}
public Task<ExchangeRates> CallRatesApi()
{
//call the API, parse the results
}
public Task<ExchangeRates> GetFallbackRates()
{
// load the rates from the embedded file and parse them
}
}
And that's all it takes! It works brilliantly when you actually use it properly 😉
Summary
This post ended up a lot longer than I intended, as it was a bit of a post-incident brain-dump, so apologies for that! It serves as somewhat of a cautionary tale about having blinkers on when coding something. When I implemented the fix initially I was so caught up in how I was solving the problem I completely overlooked this simple, but crucial, difference in how policies can be implemented.
I'm not entirely sure if in general it's best to use shared policies, or if it's better to create and discard policies as I did originally. Obviously, the latter doesn't work for circuit breaker but what about Retry
or WaitAndRetry
? Also, creating a new policy each time is probably more "allocate-y", but is it faster due to not having to be thread safe?
I don't know the answer, but personally, and based on this episode, I'm inclined to go with shared policies everywhere. If you know otherwise, do let me know in the comments, thanks!