How to debug slow lambda response times

How to debug slow Lambda response times Yan Cui @theburningmonk Developer Advocate, Lumigo AWS Serverless Hero Author of Production-Ready Serverless

MyApiFunction Worker Worker …

overloaded servers are a thing of the past

observation majority of performance problems originates from a function’s integration points

macro how well is this service performing in general? micro why did this transaction perform poorly?

macro micro identify systemic issues how well is this service performing in general? why did this transaction perform poorly?

how well is this service performing in general? why did this transaction perform poorly? macro micro why did this user get a bad exp?

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. what do we need to collect?

Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @ AWS user since 2009

Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advisetraining delivery

API Gateway Lambda DynamoDBhow long did this req take?

what is the state of the world?

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. what are the most important outputs to collect?

macro micro how well is this service performing in general? why did this transaction perform poorly?

API Gateway API GatewayLambda Lambda DynamoDB Service A Service B

API Gateway API GatewayLambda Lambda DynamoDB Service A Service B how long did service B took to respond?

API Gateway API GatewayLambda Lambda DynamoDB Service A Service B how long did service B took to respond? was DynamoDB slow? was it a cold start? could it be API Gateway?

Lambda time to create and initialize the worker instance

for API functions, use API Gateway’s IntegrationLatency as a proxy for “total response time from Lambda”

DynamoDB SuccessfulRequestLatency

“I'm facing this problem now with a lambda that usually takes 25 ms but once a week or so takes > 6000 ms and times out. The lambda's ﬁrst step is to load a DynamoDB table that only has 8 items. I'm at a loss to understand how such a simple query could take so long.”

START 1st attempt exponential backoff (1)

START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2)

START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2) 3rd attempt exponential backoff (3)

START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2) 3rd attempt exponential backoff (3) 4th attempt success!

START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2) 3rd attempt exponential backoff (3) 4th attempt success! SuccessfulRequestLatency

JavaScript AWS SDK 10 retries Initial exponential backoff of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) this is Marc Brooker’s fav formula!

10 retries Initial exponential backoff of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) JavaScript AWS SDK

10 retries Initial exponential backoff of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) JavaScript AWS SDK danger zone!

Record client-side latency metrics for IO operations

www.youtube.com/watch?v=adtCwnKApWI

IntegrationLatency [API Gateway] Latency [API Gateway]

API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]

Duration [Lambda] API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]

Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]

SuccessfulRequestLatency [DynamoDB] Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]

Caller-side DynamoDB latency [custom metric] SuccessfulRequestLatency [DynamoDB] Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]

Caller-side retries (mostly) Caller-side DynamoDB latency [custom metric] SuccessfulRequestLatency [DynamoDB] Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]

Latency (ms) Time Latency IntegrationLatency Duration Caller-side DynamoDB latency SuccessfulRequestLatency

how well is this service performing in general? macro

why did this transaction perform poorly? micro

X-Ray can be encapsulated in custom modules

X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time)

X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling

X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds signiﬁcant overhead

X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds signiﬁcant overhead doesn’t trace TCP trafﬁc (RDS/Elasticache)

X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds signiﬁcant overhead doesn’t trace TCP trafﬁc (RDS/Elasticache) poor support for saync event sources (only SNS)

X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds signiﬁcant overhead doesn’t trace TCP trafﬁc (RDS/Elasticache) poor support for saync event sources (only SNS) doesn’t capture request & response data

X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds signiﬁcant overhead doesn’t trace TCP trafﬁc (RDS/Elasticache) poor support for saync event sources (only SNS) doesn’t capture request & response data logs and traces are separate

X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache) poor support for saync event sources (only SNS) doesn’t capture request & response data logs and traces are separate difficult to search

X-Ray good enough for simple workloads when you outgrow X-Ray, look for a 3rd-party tool

answer both macro and micro level questions in just a few clicks!

Support async event sources such as Kinesis, DynamoDB streams and SNS

Support TCP trafﬁc - e.g. RDS and Elasticache

trace 500K invocations per month for FREE with promo code Yan500 platform.lumigo.io/signup

How to mitigate slow dependencies?

if not, a good caching strategy often helps

Client Server 1 Server 2 50ms later

runing required for each service

helps in some cases but can exaspate the problem in other cases

platform.lumigo.io/signup trace 500K invocations per month for FREE with promo code Yan500

@theburningmonk theburningmonk.com github.com/theburningmonk yan@lumigo.io

How to debug slow lambda response times

More Related Content

What's hot

Similar to How to debug slow lambda response times

More from Yan Cui

Recently uploaded

How to debug slow lambda response times