Break Me If You Can Practical Guide to Building Fault-tolerant Systems Devoxx Belgium, November 15, 2018 Alex Borysov, Software Engineer @ Google Mykyta Protsenko, Software Engineer @ Netflix
Who are we? Alex Borysov Software Engineer @Google Mykyta Protsenko Software Engineer @Netflix @aiborisov @mykyta_p
Fault-Tolerance? @aiborisov @mykyta_p
Fault vs Error vs Failure @aiborisov @mykyta_p
@aiborisov @mykyta_p Fault @aiborisov @mykyta_p incorrect internal state Picture by Bob McMillan. Public domain. See slide #180 for details.
@aiborisov @mykyta_p Error @aiborisov @mykyta_p visibly incorrect behaviour Picture by David Goehring. CC BY 2.0. See slide #180 for details.
@aiborisov @mykyta_p Failure @aiborisov @mykyta_p main functionality is broken Picture by Camerafiend. CC BY-SA 3.0. See slide #180 for details.
@aiborisov @mykyta_p RMS Titanic vs Miracle on the Hudson @aiborisov @mykyta_p Willy Stöwer. Public domain. See slide #180 for details. By Greg Lam Pak Ng. CC BY 2.0. See slide #181 for details.
@aiborisov @mykyta_p RMS Titanic @aiborisov @mykyta_p Fault: Hitting an iceberg Error: Water in the hull Failure: Sinking Willy Stöwer. Public domain. See slide #180 for details.
@aiborisov @mykyta_p Miracle on the Hudson @aiborisov @mykyta_p Fault: Hitting geese at 859 m Error: Engines shut down No Failure! By Greg Lam Pak Ng. CC BY 2.0. See slide #181 for details.
Fault Error Failure @aiborisov @mykyta_p → →
Fault Error Failure @aiborisov @mykyta_p → →
@aiborisov @mykyta_p Fault Tolerance @aiborisov @mykyta_p Code and Design Patterns Product-Driven Decisions Communication By Greg Lam Pak Ng. CC BY 2.0. See slide #181 for details.
Dodging Geese @aiborisov @mykyta_p
@aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p See slides ##180, 181 for licensing details.
@aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
@aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Leaderboard Service API Gateway @aiborisov @mykyta_p Clouds Service
@aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Leaderboard Service API Gateway @aiborisov @mykyta_p Clouds Service Geese Service
@aiborisov @mykyta_p Dodging Geese Architecture Geese Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p TOP-5 Leaderboard Service
@aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
@aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
@aiborisov @mykyta_p Dodging Geese Architecture TOP-5 Geese Service Clouds Service Leaderboard Service API Gateway @aiborisov @mykyta_p
@aiborisov @mykyta_p Leaderboard API (REST) /players/<username>/score {"name": "Jane", "score": 100} /leaderboard/top/<n> [{"name": "Jane", "score": 100}, {"name": "John", "score": 50}, ...] @aiborisov @mykyta_p
@aiborisov @mykyta_p gRPC Service Definitions @aiborisov @mykyta_p service GeeseService { // Return next line of geese. rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); }
@aiborisov @mykyta_p gRPC Service Definitions @aiborisov @mykyta_p service GeeseService { // Return next line of geese. rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); } service CloudsService { // Return next line of clouds. rpc GetClouds (GetCloudsRequest) returns (CloudsResponse); }
@aiborisov @mykyta_p service FixtureService { // Return next line of geese and clouds. rpc GetFixture (GetFixtureRequest) returns (FixtureResponse); } gRPC Gateway Service @aiborisov @mykyta_p
@aiborisov @mykyta_p service FixtureService { // Return next line of geese and clouds. rpc GetFixture (GetFixtureRequest) returns (FixtureResponse); } + = Fixture gRPC Gateway Service @aiborisov @mykyta_p
@aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Fixture Service @aiborisov @mykyta_p
@aiborisov @mykyta_p Gateway Fixture Service Geese Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p
@aiborisov @mykyta_p Gateway Fixture Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p Geese Service
@aiborisov @mykyta_p Gateway Fixture Service Clouds ServiceAPI Gateway @aiborisov @mykyta_p Geese Service
@aiborisov @mykyta_p Gateway Fixture Service API Gateway @aiborisov @mykyta_p Geese Service Clouds Service
@aiborisov @mykyta_p Gateway Fixture Service API Gateway @aiborisov @mykyta_p Geese Service Clouds Service
@aiborisov @mykyta_p @aiborisov @mykyta_p Fixture Latency = Geese Latency + Clouds Latency
@aiborisov @mykyta_p @aiborisov @mykyta_p Non-Blocking Calls Don’t block Send requests in parallel Combine results when ready
@aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...;
@aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...; @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
@aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { Gateway Service Implementation @aiborisov @mykyta_p private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...; @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Slow dependencies Slow upstream services
@aiborisov @mykyta_p @aiborisov @mykyta_p Timeouts Guaranteed latency for integration points
@aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
@aiborisov @mykyta_p public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.withDeadlineAfter(500, MILLISECONDS).getGeese(toGeeseRequest(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.withDeadlineAfter(500, MILLISECONDS).getClouds(toCloudsRequest(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...
@aiborisov @mykyta_p @Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) { ListenableFuture<GeeseResponse> geese = geeseClient.withDeadlineAfter(500, MILLISECONDS).getGeese(toGeeseRequest(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.withDeadlineAfter(500, MILLISECONDS).getClouds(toCloudsRequest(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ... public class FixtureService extends FixtureServiceImplBase { ... Gateway Service Implementation @aiborisov @mykyta_p
@aiborisov @mykyta_p REST: Non-Blocking Calls CompletableFuture<List<LeaderboardEntry>> leaderboard = httpClient .get().uri("/top/5") .exchange() .timeout(Duration.ofMillis(500)) .flatMap(cr -> cr.bodyToMono(...)) .toFuture(); @aiborisov @mykyta_p
@aiborisov @mykyta_p REST: Non-Blocking Calls with Timeout CompletableFuture<List<LeaderboardEntry>> leaderboard = httpClient .get().uri("/top/5") .exchange() .timeout(Duration.ofMillis(500)) .flatMap(cr -> cr.bodyToMono(...)) .toFuture(); @aiborisov @mykyta_p
@aiborisov @mykyta_p
Demo @aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p No Geese No Clouds Blinking Leaderboard
@aiborisov @mykyta_p @aiborisov @mykyta_p Observability Monitoring: QPS, latency, errors, ...
@aiborisov @mykyta_p @aiborisov @mykyta_p Observability: gRPC Monitoring: QPS, latency, errors, ... // OpenCensus RpcViews.registerAllViews();
@aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: gRPC GrpcTracing grpcTracing = GrpcTracing.create(...); ManagedChannelBuilder ... .intercept(grpcTracing.newClientInterceptor()) .build() ; ServerBuilder.forPort(8080) ... .intercept(grpcTracing.newServerInterceptor()) .build();
@aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: gRPC GrpcTracing grpcTracing = GrpcTracing.create(...); ManagedChannelBuilder ... .intercept(grpcTracing.newClientInterceptor()) .build(); ServerBuilder.forPort(8080) ... .intercept(grpcTracing.newServerInterceptor()) .build();
@aiborisov @mykyta_p @aiborisov @mykyta_p Tracing: REST build.gradle: dependencies { compile '...:spring-cloud-sleuth-zipkin' compile '...:spring-cloud-starter-sleuth' ... } application.properties: spring.zipkin.baseUrl=http://zipkin:9411/ spring.sleuth.sampler.probability=1.0 spring.sleuth.web.enabled=true
@aiborisov @mykyta_p
Demo @aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Clouds are slow Geese are fast Entire call fails
@aiborisov @mykyta_p ListenableFuture<GeeseResponse> geese = geeseClient..getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ... @aiborisov @mykyta_p Partial Degradation
@aiborisov @mykyta_p @aiborisov @mykyta_p Partial Degradation ListenableFuture<GeeseResponse> geese = geeseClient..getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request)); ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.successfulAsList(geese, clouds); ...
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Some L-board calls fail L-board latency is low Scores disappear
@aiborisov @mykyta_p CompletableFuture<List<Leaderboard>> request() { return httpClient .get().uri("/top/5").exchange() .timeout(Duration.ofMillis(500)) .flatMap(...).toFuture(); } @aiborisov @mykyta_p Retries: REST
@aiborisov @mykyta_p CompletableFuture<List<Leaderboard>> request() { return httpClient .get().uri("/top/5").exchange() .timeout(Duration.ofMillis(500)) .flatMap(...).toFuture(); } RetryPolicy RETRY_POLICY = new RetryPolicy() .retryOn(IOException.class) .withMaxRetries(MAX_RETRIES); CompletableFuture<List<Leaderboard>> top5 = Failsafe.with(RETRY_POLICY) ... .future(this::httpRequest); @aiborisov @mykyta_p Retries: REST
@aiborisov @mykyta_p
Demo @aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Retry slow calls? Retry failed calls? Retry network faults?
@aiborisov @mykyta_p Retry Storm Clouds ServiceAPI Gateway @aiborisov @mykyta_p
@aiborisov @mykyta_p new RetryPolicy() .withBackoff( MIN_DELAY, MAX_DELAY, TimeUnit.MILLISECONDS, 100.0) ... ... @aiborisov @mykyta_p Exponential Backoffs
@aiborisov @mykyta_p Failsafe .with(RETRY_POLICY) .withFallback( () -> emptyLeaderboard()) ... @aiborisov @mykyta_p Fallbacks
@aiborisov @mykyta_p Failsafe .with(RETRY_POLICY) .withFallback( () -> cachedLeaderboard()) ... @aiborisov @mykyta_p Fallbacks
@aiborisov @mykyta_p Retry Fallback Fail Fast @aiborisov @mykyta_p On Error
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability?
@aiborisov @mykyta_p @aiborisov @mykyta_p High 99%ile latency 100 requests Error probability: 1 – 0.99^100 = 63%
@aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline
@aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms
@aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms Request
@aiborisov @mykyta_p Tail-Tolerance @aiborisov @mykyta_p Request 200 ms deadline ↓ 100 ms Request Fastest Response
@aiborisov @mykyta_p High 99%ile latency 100 requests @aiborisov @mykyta_p Request Hedging
@aiborisov @mykyta_p High 99%ile latency 100 requests Error probability: 63% x 0.01 < 1% @aiborisov @mykyta_p Request Hedging
@aiborisov @mykyta_p Channel geeseChannel = ManagedChannelBuilder .forAddress(geeseHost, geesePort) .enableRetry() .maxHedgedAttempts(MAX_HEDGES) .build(); GeeseServiceFutureStub geeseStub = GeeseServiceGrpc .newFutureStub(geeseChannel); @aiborisov @mykyta_p Hedging in gRPC (soon)
@aiborisov @mykyta_p Channel geeseChannel = ManagedChannelBuilder .forAddress(geeseHost, geesePort) .enableRetry() .maxHedgedAttempts(MAX_HEDGES) .build(); GeeseServiceFutureStub geeseStub = GeeseServiceGrpc .newFutureStub(geeseChannel); @aiborisov @mykyta_p Hedging in gRPC (soon)
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p High mean latency 100 requests Error probability?
@aiborisov @mykyta_p @aiborisov @mykyta_p High mean latency 100 requests Error probability: 1 – 0.50^100 = 99.99...%
@aiborisov @mykyta_p CircuitBreaker CIRCUIT_BREAKER = new CircuitBreaker() .withFailureThreshold(3, 5); CompletableFuture<...> top5 = Failsafe .with(CIRCUIT_BREAKER) .with(RETRY_POLICY) ... .future(this::httpRequest); @aiborisov @mykyta_p Circuit Breaker
@aiborisov @mykyta_p @aiborisov @mykyta_p Error Handling 100% Error Fail Fast Intermittent Slow Hedging Intermittent Fast Retry Fallback✚
@aiborisov @mykyta_p @aiborisov @mykyta_p Error Handling 100% Error Fail Fast Intermittent Slow Hedging Intermittent Fast Retry Fallback✚
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Client-driven deadline Don’t process failed calls
@aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p See slides ##180, 181 for licensing details.
@aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Deadline 200 ms →
@aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Deadline 200 ms → Spent 120 ms →
@aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline 200 ms → X
@aiborisov @mykyta_p Deadlines API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline 200 ms → X →
@aiborisov @mykyta_p Deadlines Propagation API Gateway @aiborisov @mykyta_p Deadline 200 ms →
@aiborisov @mykyta_p Deadline 80 ms Deadlines Propagation API Gateway @aiborisov @mykyta_p Deadline 200 ms → Spent 120 ms →
@aiborisov @mykyta_p Deadline 80 ms Deadlines Propagation API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline 200 ms → X
@aiborisov @mykyta_p Deadline 80 ms Deadlines Propagation API Gateway @aiborisov @mykyta_p Spent 120 ms → Spent 90 ms Deadline -10 ms Deadline 200 ms → X
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Throughput has limits Exceeding limits?
@aiborisov @mykyta_p new ConcurrencyLimitServletFilter( new ServletLimiterBuilder() .partitionByHeader("GEESE_TYPE", c -> c.assign("premium", 0.9) .assign("free", 0.1)) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); @aiborisov @mykyta_p REST
@aiborisov @mykyta_p new ConcurrencyLimitServletFilter( new ServletLimiterBuilder() .partitionByHeader("GEESE_TYPE", c -> c.assign("premium", 0.9) .assign("free", 0.1)) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); @aiborisov @mykyta_p REST
@aiborisov @mykyta_p var limiter = new GrpcServerLimiterBuilder() .partitionByHeader(GEESE_TYPE) .partition("premium", 0.9) .partition("free", 0.1) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); ConcurrencyLimitServerInterceptor .newBuilder(limiter).build(); @aiborisov @mykyta_p gRPC: Server
@aiborisov @mykyta_p var limiter = new GrpcServerLimiterBuilder() .partitionByHeader(GEESE_TYPE) .partition("premium", 0.9) .partition("free", 0.1) .limiter(l -> l.limit( newBuilder() .initialLimit(1000)...); ConcurrencyLimitServerInterceptor .newBuilder(limiter).build(); @aiborisov @mykyta_p gRPC: Server
@aiborisov @mykyta_p new GrpcClientLimiterBuilder() .limit( newBuilder() .initialLimit(1000).build()) .blockOnLimit(false) // fail-fast .build(); @aiborisov @mykyta_p gRPC: Client
@aiborisov @mykyta_p
Demo @aiborisov @mykyta_p
Demo @aiborisov @mykyta_p
@aiborisov @mykyta_p Monitoring @aiborisov @mykyta_p APM Service metrics Distributed tracing Business metrics Picture by Alex Borysov. CC BY 2.0. See slide #180 for details.
@aiborisov @mykyta_p @aiborisov @mykyta_p Code and Design Timeouts / Deadline Propagation Retries / Hedging Proper Fallbacks Concurrency Limits Load Shedding Observability
@aiborisov @mykyta_p @aiborisov @mykyta_p Request for each response Requests don’t change
@aiborisov @mykyta_p Redundant Requests @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseRequest GeeseResponse GeeseRequest GeeseResponse
@aiborisov @mykyta_p Redundant Requests @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseRequest GeeseResponse GeeseRequest GeeseResponse
@aiborisov @mykyta_p Streaming @aiborisov @mykyta_p GeeseRequest GeeseResponse GeeseResponse GeeseResponse
@aiborisov @mykyta_p service GeeseService { rpc GetGeese (GetGeeseRequest) returns (GeeseResponse); } service CloudsService { rpc GetClouds (GetCloudsRequest) returns (CloudsResponse); } @aiborisov @mykyta_p gRPC Streaming
@aiborisov @mykyta_p service GeeseService { rpc GetGeese (GetGeeseRequest) returns (stream GeeseResponse); } service CloudsService { rpc GetClouds (GetCloudsRequest) returns (stream CloudsResponse); } @aiborisov @mykyta_p gRPC Streaming
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Server faster than client Client cannot keep up
@aiborisov @mykyta_p Too Many Streaming Responses @aiborisov @mykyta_p GeeseRequest
@aiborisov @mykyta_p Too Many Streaming Responses @aiborisov @mykyta_p GeeseRequest X
@aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest
@aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5
@aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5
@aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5 3
@aiborisov @mykyta_p Flow Control @aiborisov @mykyta_p GeeseRequest 5 3
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Decople producer and consumer Decople failures
@aiborisov @mykyta_p Message-driven Elastic Responsive Resilient @aiborisov @mykyta_p Reactive Systems
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Per instance limits
@aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p Why didn’t Rose make room for Jack on the door? Willy Stöwer. Public domain. See slide #180 for details.
@aiborisov @mykyta_p Door Capacity @aiborisov @mykyta_p Why didn’t Rose make room for Jack on the door? “ The answer is very simple because it says on page 147 that Jack dies “ James Cameron Willy Stöwer. Public domain. See slide #180 for details.
@aiborisov @mykyta_p Capacity @aiborisov @mykyta_p
@aiborisov @mykyta_p Capacity @aiborisov @mykyta_p
@aiborisov @mykyta_p Autoscaling @aiborisov @mykyta_p
@aiborisov @mykyta_p Prescaling @aiborisov @mykyta_p
@aiborisov @mykyta_p Prescaling @aiborisov @mykyta_p See slides ##180, 182 for licensing details.
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Services break each other
@aiborisov @mykyta_p $ Free and Premium? Free Premium $
@aiborisov @mykyta_p Free and Premium Outage Free Premium $ $
@aiborisov @mykyta_p $ $ Bulkheads Free Premium $
@aiborisov @mykyta_p Bulkheads Free Premium $ $ $
@aiborisov @mykyta_p @aiborisov @mykyta_p Bulkheads By Request Type By Client Priority By Region By Availability Zone etc
@aiborisov @mykyta_p
Demo @aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Bad user experience Metrics are not enough
@aiborisov @mykyta_p Prober TOP-5 API Gateway @aiborisov @mykyta_p
@aiborisov @mykyta_p Prober TOP-5 API Gateway @aiborisov @mykyta_p See slides ##180, 182 for licensing details.
@aiborisov @mykyta_p @aiborisov @mykyta_p Prober Availability Latency SLO Response verification
@aiborisov @mykyta_p @aiborisov @mykyta_p Prober Availability Latency SLO Response verification CloudProber.org
@aiborisov @mykyta_p
@aiborisov @mykyta_p
@aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Technical solutions are not enough
@aiborisov @mykyta_p Communication @aiborisov @mykyta_p
@aiborisov @mykyta_p Communication @aiborisov @mykyta_p
@aiborisov @mykyta_p Communication Channels @aiborisov @mykyta_p GEESE at 270
@aiborisov @mykyta_p Communication Channels @aiborisov @mykyta_p GEESE at 270
@aiborisov @mykyta_p GEESE at 270 Communication Channels @aiborisov @mykyta_p
@aiborisov @mykyta_p GEESE at 270 Communication Channels @aiborisov @mykyta_p
@aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Blameless Constructive
@aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Blameless Constructive Social See slides ##189, 182, 183 for licensing details.
@aiborisov @mykyta_p Postmortems @aiborisov @mykyta_p Timeline Causes Remedies
@aiborisov @mykyta_p @aiborisov @mykyta_p Learn from Failure Blameless postmortems Alert playbooks Incident knowledge base
@aiborisov @mykyta_p
@aiborisov @mykyta_p Libraries and Tools @aiborisov @mykyta_p Demo: github.com/break-me-if-you-can Failsafe: github.com/jhalterman/failsafe Observability: opencensus.io, opentracing.io Prober: cloudprober.org Concurrency Limits: github.com/Netflix/concurrency-limits
@aiborisov @mykyta_p Demo UI @HalloGene_ Yevgen Golubenko Twitter: @HalloGene_ github.com/HalloGene Picture by Yevgen Golubenko. Also see slide #183 for licensing details.
@aiborisov @mykyta_p Books @aiborisov @mykyta_p
@aiborisov @mykyta_p @aiborisov @mykyta_p Fault-Tolerance Code & Design Patterns Product decisions Communication culture
@aiborisov @mykyta_p Please Break Me! If you can
@aiborisov @mykyta_p Please Break Me! Rate If you can
@aiborisov @mykyta_p Please Break Me! Rate Us If you can
@aiborisov @mykyta_p Please Break Me! Rate Us If you enjoyed the talk Or give feedback If you can
@aiborisov @mykyta_p Please Break Me! Rate Us If you enjoyed the talk Or give feedback If you can 5 STARS!
@aiborisov @mykyta_p
@aiborisov @mykyta_p Images and Licensing Images of geese, clouds, pilots, plane, arrows, cup, airport traffic control tower are property of Mykyta Protsenko and Alex Borysov, if not stated otherwise (see below). All Rights Reserved. Other images used: Slide #5: commons.wikimedia.org/wiki/File:FEMA_-_16381_-_Photograph_by_Bob_McMillan_taken_on_09-28-2005_in_Texas.jpg - Picture by Bob McMillan, the US federal government work, public domain Slide #6: www.flickr.com/photos/carbonnyc/3290528875 - Picture by David Goehring. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0 - changes were made Slide #7: www.flickr.com/photos/carbonnyc/3290528875 - Picture by Camerafiend. Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0): creativecommons.org/licenses/by-sa/3.0/deed.en - no changes were made Slides ##8, 9, 134, 135: commons.wikimedia.org/wiki/File:Titanic_sinking,_painting_by_Willy_St%C3%B6wer.jpg - Willy Stöwer. Public domain work of art
@aiborisov @mykyta_p Images and Licensing Slides ##8, 10, 13: www.flickr.com/photos/22608787@N00/3200086900 - Picture y Greg Lam Pak Ng. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0 - no changes were made Slides ##15-22, 29-33, 67, 76-79, 93-101, 115-117, 122-128, 136-140, 143-146, 151-152: - Blue Game Boy Color by kure: piq.codeus.net/picture/31994/Blue-Game-Boy-Color - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slides ##93-101: - The Sun by Vinicius615: piq.codeus.net/picture/191706/The-Sun - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slide #112: - Picture by Alex Borysov. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0
@aiborisov @mykyta_p Images and Licensing Slide #140: piq.codeus.net/picture/254492/CVsantahat - Santa hat for CommanderVideo, CVsantahat by anonymous - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - no changes were made Slide #152: piq.codeus.net/picture/423109/UFO - UFO by anonymous - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - no changes were made Slides #166, 167: piq.codeus.net/picture/334023/beer - beer by Investa - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made
@aiborisov @mykyta_p Images and Licensing Slides #166, 167: piq.codeus.net/picture/444498/Beer-Bottle - Beer Bottle by jacklrj - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made Slide #171: https://piq.codeus.net/picture/330338/Deal-With-It - Deal With It by Shiro - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made

Break me if you can: practical guide to building fault-tolerant systems (with examples from REST and gRPC polyglot stacks)