"Resilience não é sobre evitar falhas, é sobre falhar graciosamente e se recuperar rapidamente." — #30DiasJava Resilience Notes
🎯 Objetivo do Day 17
Em sistemas distribuídos, falhas são inevitáveis. O Day 17 do #30DiasJava implementou resilience patterns completos com Resilience4j: circuit breakers, retry, timeout, rate limiting e bulkhead para garantir que o sistema continue funcionando mesmo quando serviços externos falham.
🛠️ O que foi implementado
✅ Circuit Breaker Pattern
- Resilience4j integration: Circuit breakers automáticos para chamadas externas
- Three states: CLOSED (normal), OPEN (falha), HALF_OPEN (testando recuperação)
- Failure threshold: Abre após X falhas consecutivas
- Wait duration: Tempo antes de tentar HALF_OPEN
- Success threshold: Falhas permitidas em HALF_OPEN antes de fechar
- Metrics: Estado, transições, falhas registradas
✅ Retry Pattern
- Exponential backoff: Retry com delay exponencial
- Max attempts: Número máximo de tentativas
- Retry conditions: Retry apenas em exceções específicas
- Jitter: Randomização do delay para evitar thundering herd
- Metrics: Tentativas, sucessos, falhas
✅ Timeout Pattern
- Call timeout: Timeout por chamada individual
- Global timeout: Timeout global para operações
- Timeout exceptions: Exceções específicas para timeout
- Metrics: Timeouts registrados, tempo médio de resposta
✅ Rate Limiting
- Request rate limiting: Limite de requisições por período
- Thread pool limiting: Limite de threads simultâneas
- Dynamic limits: Limites ajustáveis em runtime
- Metrics: Requests permitidos, bloqueados, taxa atual
✅ Bulkhead Pattern
- Thread pool isolation: Pools de threads separados por serviço
- Semaphore isolation: Semáforos para limitar concorrência
- Resource isolation: Isolamento de recursos por contexto
- Metrics: Threads ativas, requisições em espera
📊 Arquitetura
┌─────────────────────────────────────────────────────────┐
│ Client Service │
└──────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Resilience Layer (Resilience4j) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Circuit │ │ Retry │ │ Timeout │ │
│ │ Breaker │ │ Pattern │ │ Pattern │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Rate Limiter │ │ Bulkhead │ │
│ └──────────────┘ └──────────────┘ │
└──────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ External Service / Database │
└─────────────────────────────────────────────────────────┘
💻 Implementação
Dependências (pom.xml)
<!-- Resilience4j -->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-micrometer</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-reactor</artifactId>
<version>2.1.0</version>
</dependency>
Configuração (application.yml)
resilience4j:
circuitbreaker:
instances:
paymentService:
registerHealthIndicator: true
slidingWindowSize: 10
minimumNumberOfCalls: 5
permittedNumberOfCallsInHalfOpenState: 3
automaticTransitionFromOpenToHalfOpenEnabled: true
waitDurationInOpenState: 10s
failureRateThreshold: 50
slowCallRateThreshold: 100
slowCallDurationThreshold: 2s
notificationService:
registerHealthIndicator: true
slidingWindowSize: 10
minimumNumberOfCalls: 5
waitDurationInOpenState: 5s
failureRateThreshold: 50
retry:
instances:
paymentService:
maxAttempts: 3
waitDuration: 1s
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
retryExceptions:
- java.net.ConnectException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- com.example.BusinessException
timelimiter:
instances:
paymentService:
timeoutDuration: 3s
cancelRunningFuture: true
ratelimiter:
instances:
paymentService:
limitForPeriod: 10
limitRefreshPeriod: 1s
timeoutDuration: 0
subscribeToEvents: true
registerHealthIndicator: true
bulkhead:
instances:
paymentService:
maxConcurrentCalls: 10
maxWaitDuration: 0
Circuit Breaker Service
@Service
@Slf4j
public class PaymentService {
private final RestTemplate restTemplate;
private final CircuitBreaker circuitBreaker;
private final Retry retry;
private final TimeLimiter timeLimiter;
private final RateLimiter rateLimiter;
private final Bulkhead bulkhead;
public PaymentService(
RestTemplate restTemplate,
CircuitBreakerRegistry circuitBreakerRegistry,
RetryRegistry retryRegistry,
TimeLimiterRegistry timeLimiterRegistry,
RateLimiterRegistry rateLimiterRegistry,
BulkheadRegistry bulkheadRegistry) {
this.restTemplate = restTemplate;
this.circuitBreaker = circuitBreakerRegistry.circuitBreaker("paymentService");
this.retry = retryRegistry.retry("paymentService");
this.timeLimiter = timeLimiterRegistry.timeLimiter("paymentService");
this.rateLimiter = rateLimiterRegistry.rateLimiter("paymentService");
this.bulkhead = bulkheadRegistry.bulkhead("paymentService");
}
@CircuitBreaker(name = "paymentService", fallbackMethod = "processPaymentFallback")
@Retry(name = "paymentService")
@TimeLimiter(name = "paymentService")
@RateLimiter(name = "paymentService")
@Bulkhead(name = "paymentService")
public CompletableFuture<PaymentResponse> processPayment(PaymentRequest request) {
log.info("Processing payment: {}", request);
return CompletableFuture.supplyAsync(() -> {
ResponseEntity<PaymentResponse> response = restTemplate.postForEntity(
"http://payment-service/api/payments",
request,
PaymentResponse.class
);
return response.getBody();
});
}
public CompletableFuture<PaymentResponse> processPaymentFallback(
PaymentRequest request,
Exception ex) {
log.error("Payment service unavailable, using fallback", ex);
// Fallback: Queue payment for later processing
return CompletableFuture.completedFuture(
PaymentResponse.builder()
.status("QUEUED")
.message("Payment queued for processing")
.build()
);
}
}
Resilience Decorator (Manual)
@Service
@Slf4j
public class NotificationService {
private final RestTemplate restTemplate;
private final CircuitBreaker circuitBreaker;
private final Retry retry;
public NotificationService(
RestTemplate restTemplate,
CircuitBreakerRegistry circuitBreakerRegistry,
RetryRegistry retryRegistry) {
this.restTemplate = restTemplate;
this.circuitBreaker = circuitBreakerRegistry.circuitBreaker("notificationService");
this.retry = retryRegistry.retry("notificationService");
}
public void sendNotification(NotificationRequest request) {
Supplier<String> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> {
return Retry.decorateSupplier(retry, () -> {
log.info("Sending notification: {}", request);
ResponseEntity<String> response = restTemplate.postForEntity(
"http://notification-service/api/notifications",
request,
String.class
);
return response.getBody();
}).get();
});
try {
String result = Try.ofSupplier(decoratedSupplier)
.recover(throwable -> {
log.error("Notification failed, using fallback", throwable);
return "Notification queued";
})
.get();
log.info("Notification result: {}", result);
} catch (Exception e) {
log.error("Failed to send notification", e);
}
}
}
Metrics & Observability
@Configuration
public class ResilienceMetricsConfig {
@Bean
public MeterRegistryCustomizer<MeterRegistry> resilienceMetricsCustomizer() {
return registry -> {
// Circuit Breaker Metrics
CircuitBreakerMetrics.ofCircuitBreakerRegistry(
circuitBreakerRegistry()
).bindTo(registry);
// Retry Metrics
RetryMetrics.ofRetryRegistry(
retryRegistry()
).bindTo(registry);
// Rate Limiter Metrics
RateLimiterMetrics.ofRateLimiterRegistry(
rateLimiterRegistry()
).bindTo(registry);
// Bulkhead Metrics
BulkheadMetrics.ofBulkheadRegistry(
bulkheadRegistry()
).bindTo(registry);
};
}
}
Health Indicators
@Component
public class ResilienceHealthIndicator implements HealthIndicator {
private final CircuitBreakerRegistry circuitBreakerRegistry;
public ResilienceHealthIndicator(CircuitBreakerRegistry circuitBreakerRegistry) {
this.circuitBreakerRegistry = circuitBreakerRegistry;
}
@Override
public Health health() {
Health.Builder builder = new Health.Builder();
circuitBreakerRegistry.getAllCircuitBreakers().forEach((name, circuitBreaker) -> {
CircuitBreaker.State state = circuitBreaker.getState();
builder.withDetail(name + ".state", state.name());
builder.withDetail(name + ".failureRate",
circuitBreaker.getMetrics().getFailureRate());
});
return builder.build();
}
}
📈 Métricas e Observabilidade
Prometheus Metrics
# Circuit Breaker Metrics
resilience4j_circuitbreaker_calls{name="paymentService",state="closed"} 100
resilience4j_circuitbreaker_calls{name="paymentService",state="open"} 5
resilience4j_circuitbreaker_calls{name="paymentService",state="half_open"} 2
resilience4j_circuitbreaker_failure_rate{name="paymentService"} 0.15
resilience4j_circuitbreaker_slow_calls{name="paymentService"} 3
# Retry Metrics
resilience4j_retry_calls{name="paymentService",result="successful"} 95
resilience4j_retry_calls{name="paymentService",result="failed"} 5
resilience4j_retry_retries{name="paymentService"} 12
# Rate Limiter Metrics
resilience4j_ratelimiter_available_permissions{name="paymentService"} 8
resilience4j_ratelimiter_waiting_threads{name="paymentService"} 2
# Bulkhead Metrics
resilience4j_bulkhead_available_concurrent_calls{name="paymentService"} 7
resilience4j_bulkhead_threads_waiting{name="paymentService"} 3
Grafana Dashboard
- Circuit Breaker State: Estados (CLOSED/OPEN/HALF_OPEN) por serviço
- Failure Rate: Taxa de falha por serviço
- Retry Attempts: Tentativas de retry por serviço
- Rate Limiter: Requests permitidos/bloqueados
- Bulkhead: Threads ativas/em espera
🎓 Lições Aprendidas
✅ O que funcionou bem
- Circuit breakers previnem cascading failures: Isolam falhas antes que se espalhem
- Retry com exponential backoff: Reduz carga em serviços sobrecarregados
- Timeout evita hanging requests: Requests não ficam travados indefinidamente
- Rate limiting protege serviços: Previne sobrecarga de requisições
- Bulkhead isola recursos: Falhas em um serviço não afetam outros
⚠️ Desafios e soluções
- Configuração inicial complexa: Criamos templates de configuração reutilizáveis
- Métricas podem ser overwhelming: Filtramos métricas mais importantes no dashboard
- Fallback strategies: Implementamos fallbacks apropriados para cada contexto
📚 Recursos
- Resilience4j: https://resilience4j.readme.io/
- Circuit Breaker Pattern: https://martinfowler.com/bliki/CircuitBreaker.html
- Bulkhead Pattern: https://docs.microsoft.com/en-us/azure/architecture/patterns/bulkhead
- Retry Pattern: https://docs.microsoft.com/en-us/azure/architecture/patterns/retry
🔗 Links
Next episode → Day 18/30 — Service Mesh & Istio (em breve)