How to debug slow Lambda response times Yan Cui @theburningmonk Developer Advocate, Lumigo AWS Serverless Hero Author of Production-Ready Serverless
Lambda autoscales by traffic
multi-AZ by default
MyApiFunction Worker Worker …
overloaded servers are a thing of the past
observation majority of performance problems originates from a function’s integration points
macro how well is this service performing in general? micro why did this transaction perform poorly?
macro micro identify systemic issues how well is this service performing in general? why did this transaction perform poorly?
how well is this service performing in general? why did this transaction perform poorly? macro micro why did this user get a bad exp?
In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. what do we need to collect?
Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @ AWS user since 2009
Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advisetraining delivery
API Gateway Lambda DynamoDB
API Gateway Lambda DynamoDBhow long did this req take?
what is the state of the world?
In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. what are the most important outputs to collect?
macro micro how well is this service performing in general? why did this transaction perform poorly?
API Gateway API GatewayLambda Lambda DynamoDB Service A Service B
API Gateway API GatewayLambda Lambda DynamoDB Service A Service B how long did service B took to respond?
API Gateway API GatewayLambda Lambda DynamoDB Service A Service B how long did service B took to respond? was DynamoDB slow? was it a cold start? could it be API Gateway?
API Gateway
Lambda
Lambda Duration
Lambda time to create and initialize the worker instance
Lambda bit.ly/2QXNVwc
bit.ly/2WL1uj0 Lambda
Lambda time to create and initialize the worker instance
for API functions, use API Gateway’s IntegrationLatency as a proxy for “total response time from Lambda”
DynamoDB
DynamoDB SuccessfulRequestLatency
“I'm facing this problem now with a lambda that usually takes 25 ms but once a week or so takes > 6000 ms and times out.  The lambda's first step is to load a DynamoDB table that only has 8 items.  I'm at a loss to understand how such a simple query could take so long.”
START
START 1st attempt
START 1st attempt exponential backoff (1)
START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2)
START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2) 3rd attempt exponential backoff (3)
START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2) 3rd attempt exponential backoff (3) 4th attempt success!
START 1st attempt exponential backoff (1) 2nd attempt exponential backoff (2) 3rd attempt exponential backoff (3) 4th attempt success! SuccessfulRequestLatency
JavaScript AWS SDK 10 retries Initial exponential backoff of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) this is Marc Brooker’s fav formula!
10 retries Initial exponential backoff of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) JavaScript AWS SDK
10 retries Initial exponential backoff of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) JavaScript AWS SDK danger zone!
Record client-side latency metrics for IO operations
www.youtube.com/watch?v=adtCwnKApWI
Embedded Metric Format (EMF)
Latency [API Gateway]
IntegrationLatency [API Gateway] Latency [API Gateway]
API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
Duration [Lambda] API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
SuccessfulRequestLatency [DynamoDB] Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
Caller-side DynamoDB latency [custom metric] SuccessfulRequestLatency [DynamoDB] Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
Caller-side retries (mostly) Caller-side DynamoDB latency [custom metric] SuccessfulRequestLatency [DynamoDB] Duration [Lambda] Lambda’s allocation time API Gateway’s latency overhead IntegrationLatency [API Gateway] Latency [API Gateway]
Latency (ms) Time Latency IntegrationLatency Duration Caller-side DynamoDB latency SuccessfulRequestLatency
Latency (ms) Time Latency IntegrationLatency Duration Caller-side DynamoDB latency SuccessfulRequestLatency
how well is this service performing in general? macro
why did this transaction perform poorly? micro
X-Ray
X-Ray
X-Ray can be encapsulated in custom modules
X-Ray doesn’t add latency
X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time)
X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling
X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead
X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache)
X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache) poor support for saync event sources (only SNS)
X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache) poor support for saync event sources (only SNS) doesn’t capture request & response data
X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache) poor support for saync event sources (only SNS) doesn’t capture request & response data logs and traces are separate
X-Ray doesn’t add latency can see “system” overhead (e.g. allocation time) built-in sampling X-Ray SDK adds significant overhead doesn’t trace TCP traffic (RDS/Elasticache) poor support for saync event sources (only SNS) doesn’t capture request & response data logs and traces are separate difficult to search
X-Ray good enough for simple workloads when you outgrow X-Ray, look for a 3rd-party tool
answer both macro and micro level questions in just a few clicks!
Support async event sources such as Kinesis, DynamoDB streams and SNS
Support TCP traffic - e.g. RDS and Elasticache
platform.lumigo.io/signup
trace 500K invocations per month for FREE with promo code Yan500 platform.lumigo.io/signup
How to mitigate slow dependencies?
it depends…
can you use another service?
if not, a good caching strategy often helps
bit.ly/3h7Bo41
Client Server 1 Server 2
Client Server 1 Server 2 50ms later
Client Server 1 Server 2
runing required for each service
helps in some cases but can exaspate the problem in other cases
can you use another service?
platform.lumigo.io/signup trace 500K invocations per month for FREE with promo code Yan500
@theburningmonk theburningmonk.com github.com/theburningmonk yan@lumigo.io

How to debug slow lambda response times