"Resilience não é sobre evitar falhas, é sobre falhar graciosamente e se recuperar rapidamente." — #30DiasJava Resilience Notes

🎯 Objetivo do Day 17

Em sistemas distribuídos, falhas são inevitáveis. O Day 17 do #30DiasJava implementou resilience patterns completos com Resilience4j: circuit breakers, retry, timeout, rate limiting e bulkhead para garantir que o sistema continue funcionando mesmo quando serviços externos falham.

🛠️ O que foi implementado

✅ Circuit Breaker Pattern

  • Resilience4j integration: Circuit breakers automáticos para chamadas externas
  • Three states: CLOSED (normal), OPEN (falha), HALF_OPEN (testando recuperação)
  • Failure threshold: Abre após X falhas consecutivas
  • Wait duration: Tempo antes de tentar HALF_OPEN
  • Success threshold: Falhas permitidas em HALF_OPEN antes de fechar
  • Metrics: Estado, transições, falhas registradas

✅ Retry Pattern

  • Exponential backoff: Retry com delay exponencial
  • Max attempts: Número máximo de tentativas
  • Retry conditions: Retry apenas em exceções específicas
  • Jitter: Randomização do delay para evitar thundering herd
  • Metrics: Tentativas, sucessos, falhas

✅ Timeout Pattern

  • Call timeout: Timeout por chamada individual
  • Global timeout: Timeout global para operações
  • Timeout exceptions: Exceções específicas para timeout
  • Metrics: Timeouts registrados, tempo médio de resposta

✅ Rate Limiting

  • Request rate limiting: Limite de requisições por período
  • Thread pool limiting: Limite de threads simultâneas
  • Dynamic limits: Limites ajustáveis em runtime
  • Metrics: Requests permitidos, bloqueados, taxa atual

✅ Bulkhead Pattern

  • Thread pool isolation: Pools de threads separados por serviço
  • Semaphore isolation: Semáforos para limitar concorrência
  • Resource isolation: Isolamento de recursos por contexto
  • Metrics: Threads ativas, requisições em espera

📊 Arquitetura

┌─────────────────────────────────────────────────────────┐
│                    Client Service                        │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────┐
│              Resilience Layer (Resilience4j)              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │   Circuit    │  │    Retry      │  │   Timeout    │ │
│  │   Breaker    │  │   Pattern     │  │   Pattern    │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
│  ┌──────────────┐  ┌──────────────┐                   │
│  │ Rate Limiter │  │  Bulkhead    │                   │
│  └──────────────┘  └──────────────┘                   │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────┐
│              External Service / Database                │
└─────────────────────────────────────────────────────────┘

💻 Implementação

Dependências (pom.xml)

<!-- Resilience4j -->
<dependency>
  <groupId>io.github.resilience4j</groupId>
  <artifactId>resilience4j-spring-boot3</artifactId>
  <version>2.1.0</version>
</dependency>
<dependency>
  <groupId>io.github.resilience4j</groupId>
  <artifactId>resilience4j-micrometer</artifactId>
  <version>2.1.0</version>
</dependency>
<dependency>
  <groupId>io.github.resilience4j</groupId>
  <artifactId>resilience4j-reactor</artifactId>
  <version>2.1.0</version>
</dependency>

Configuração (application.yml)

resilience4j:
  circuitbreaker:
    instances:
      paymentService:
        registerHealthIndicator: true
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        permittedNumberOfCallsInHalfOpenState: 3
        automaticTransitionFromOpenToHalfOpenEnabled: true
        waitDurationInOpenState: 10s
        failureRateThreshold: 50
        slowCallRateThreshold: 100
        slowCallDurationThreshold: 2s
      notificationService:
        registerHealthIndicator: true
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        waitDurationInOpenState: 5s
        failureRateThreshold: 50
  
  retry:
    instances:
      paymentService:
        maxAttempts: 3
        waitDuration: 1s
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        retryExceptions:
          - java.net.ConnectException
          - java.util.concurrent.TimeoutException
        ignoreExceptions:
          - com.example.BusinessException
  
  timelimiter:
    instances:
      paymentService:
        timeoutDuration: 3s
        cancelRunningFuture: true
  
  ratelimiter:
    instances:
      paymentService:
        limitForPeriod: 10
        limitRefreshPeriod: 1s
        timeoutDuration: 0
        subscribeToEvents: true
        registerHealthIndicator: true
  
  bulkhead:
    instances:
      paymentService:
        maxConcurrentCalls: 10
        maxWaitDuration: 0

Circuit Breaker Service

@Service
@Slf4j
public class PaymentService {
    
    private final RestTemplate restTemplate;
    private final CircuitBreaker circuitBreaker;
    private final Retry retry;
    private final TimeLimiter timeLimiter;
    private final RateLimiter rateLimiter;
    private final Bulkhead bulkhead;
    
    public PaymentService(
            RestTemplate restTemplate,
            CircuitBreakerRegistry circuitBreakerRegistry,
            RetryRegistry retryRegistry,
            TimeLimiterRegistry timeLimiterRegistry,
            RateLimiterRegistry rateLimiterRegistry,
            BulkheadRegistry bulkheadRegistry) {
        this.restTemplate = restTemplate;
        this.circuitBreaker = circuitBreakerRegistry.circuitBreaker("paymentService");
        this.retry = retryRegistry.retry("paymentService");
        this.timeLimiter = timeLimiterRegistry.timeLimiter("paymentService");
        this.rateLimiter = rateLimiterRegistry.rateLimiter("paymentService");
        this.bulkhead = bulkheadRegistry.bulkhead("paymentService");
    }
    
    @CircuitBreaker(name = "paymentService", fallbackMethod = "processPaymentFallback")
    @Retry(name = "paymentService")
    @TimeLimiter(name = "paymentService")
    @RateLimiter(name = "paymentService")
    @Bulkhead(name = "paymentService")
    public CompletableFuture<PaymentResponse> processPayment(PaymentRequest request) {
        log.info("Processing payment: {}", request);
        
        return CompletableFuture.supplyAsync(() -> {
            ResponseEntity<PaymentResponse> response = restTemplate.postForEntity(
                "http://payment-service/api/payments",
                request,
                PaymentResponse.class
            );
            return response.getBody();
        });
    }
    
    public CompletableFuture<PaymentResponse> processPaymentFallback(
            PaymentRequest request, 
            Exception ex) {
        log.error("Payment service unavailable, using fallback", ex);
        
        // Fallback: Queue payment for later processing
        return CompletableFuture.completedFuture(
            PaymentResponse.builder()
                .status("QUEUED")
                .message("Payment queued for processing")
                .build()
        );
    }
}

Resilience Decorator (Manual)

@Service
@Slf4j
public class NotificationService {
    
    private final RestTemplate restTemplate;
    private final CircuitBreaker circuitBreaker;
    private final Retry retry;
    
    public NotificationService(
            RestTemplate restTemplate,
            CircuitBreakerRegistry circuitBreakerRegistry,
            RetryRegistry retryRegistry) {
        this.restTemplate = restTemplate;
        this.circuitBreaker = circuitBreakerRegistry.circuitBreaker("notificationService");
        this.retry = retryRegistry.retry("notificationService");
    }
    
    public void sendNotification(NotificationRequest request) {
        Supplier<String> decoratedSupplier = CircuitBreaker
            .decorateSupplier(circuitBreaker, () -> {
                return Retry.decorateSupplier(retry, () -> {
                    log.info("Sending notification: {}", request);
                    ResponseEntity<String> response = restTemplate.postForEntity(
                        "http://notification-service/api/notifications",
                        request,
                        String.class
                    );
                    return response.getBody();
                }).get();
            });
        
        try {
            String result = Try.ofSupplier(decoratedSupplier)
                .recover(throwable -> {
                    log.error("Notification failed, using fallback", throwable);
                    return "Notification queued";
                })
                .get();
            
            log.info("Notification result: {}", result);
        } catch (Exception e) {
            log.error("Failed to send notification", e);
        }
    }
}

Metrics & Observability

@Configuration
public class ResilienceMetricsConfig {
    
    @Bean
    public MeterRegistryCustomizer<MeterRegistry> resilienceMetricsCustomizer() {
        return registry -> {
            // Circuit Breaker Metrics
            CircuitBreakerMetrics.ofCircuitBreakerRegistry(
                circuitBreakerRegistry()
            ).bindTo(registry);
            
            // Retry Metrics
            RetryMetrics.ofRetryRegistry(
                retryRegistry()
            ).bindTo(registry);
            
            // Rate Limiter Metrics
            RateLimiterMetrics.ofRateLimiterRegistry(
                rateLimiterRegistry()
            ).bindTo(registry);
            
            // Bulkhead Metrics
            BulkheadMetrics.ofBulkheadRegistry(
                bulkheadRegistry()
            ).bindTo(registry);
        };
    }
}

Health Indicators

@Component
public class ResilienceHealthIndicator implements HealthIndicator {
    
    private final CircuitBreakerRegistry circuitBreakerRegistry;
    
    public ResilienceHealthIndicator(CircuitBreakerRegistry circuitBreakerRegistry) {
        this.circuitBreakerRegistry = circuitBreakerRegistry;
    }
    
    @Override
    public Health health() {
        Health.Builder builder = new Health.Builder();
        
        circuitBreakerRegistry.getAllCircuitBreakers().forEach((name, circuitBreaker) -> {
            CircuitBreaker.State state = circuitBreaker.getState();
            builder.withDetail(name + ".state", state.name());
            builder.withDetail(name + ".failureRate", 
                circuitBreaker.getMetrics().getFailureRate());
        });
        
        return builder.build();
    }
}

📈 Métricas e Observabilidade

Prometheus Metrics

# Circuit Breaker Metrics
resilience4j_circuitbreaker_calls{name="paymentService",state="closed"} 100
resilience4j_circuitbreaker_calls{name="paymentService",state="open"} 5
resilience4j_circuitbreaker_calls{name="paymentService",state="half_open"} 2
resilience4j_circuitbreaker_failure_rate{name="paymentService"} 0.15
resilience4j_circuitbreaker_slow_calls{name="paymentService"} 3

# Retry Metrics
resilience4j_retry_calls{name="paymentService",result="successful"} 95
resilience4j_retry_calls{name="paymentService",result="failed"} 5
resilience4j_retry_retries{name="paymentService"} 12

# Rate Limiter Metrics
resilience4j_ratelimiter_available_permissions{name="paymentService"} 8
resilience4j_ratelimiter_waiting_threads{name="paymentService"} 2

# Bulkhead Metrics
resilience4j_bulkhead_available_concurrent_calls{name="paymentService"} 7
resilience4j_bulkhead_threads_waiting{name="paymentService"} 3

Grafana Dashboard

  • Circuit Breaker State: Estados (CLOSED/OPEN/HALF_OPEN) por serviço
  • Failure Rate: Taxa de falha por serviço
  • Retry Attempts: Tentativas de retry por serviço
  • Rate Limiter: Requests permitidos/bloqueados
  • Bulkhead: Threads ativas/em espera

🎓 Lições Aprendidas

✅ O que funcionou bem

  • Circuit breakers previnem cascading failures: Isolam falhas antes que se espalhem
  • Retry com exponential backoff: Reduz carga em serviços sobrecarregados
  • Timeout evita hanging requests: Requests não ficam travados indefinidamente
  • Rate limiting protege serviços: Previne sobrecarga de requisições
  • Bulkhead isola recursos: Falhas em um serviço não afetam outros

⚠️ Desafios e soluções

  • Configuração inicial complexa: Criamos templates de configuração reutilizáveis
  • Métricas podem ser overwhelming: Filtramos métricas mais importantes no dashboard
  • Fallback strategies: Implementamos fallbacks apropriados para cada contexto

📚 Recursos

🔗 Links

Next episode → Day 18/30 — Service Mesh & Istio (em breve)