3

I have a REST API in AWS API Gateway that invokes a Python Lambda function and returns some result. Most of the times this workflow works fine, meaning that the Lambda function is executed and passes the result back to the API, which in turn returns a 200 OK response.

However, there are few times in which I get a 500 error code from the API and the Lambda seems not to be even executed. The response.reason says: "Internal Server Error" and no additional information is given.

There is no difference between the failing requests and the successful ones to the API in terms of the method or parameters format.

One more comment is that the API has the cache setting enabled. I've seen similar posts and some of the answers mention the format of the JSON object returned by the Lambda function, others point to IAM permissions issues, but none of those seem to be the cause here. In fact, as this post's title says this is an intermittent behavior: most of the times it works fine, but occasionally I get this error.

Any hint would be highly appreciated.

4
  • You can enable logging on api Gateway and check the logs, it should give you some idea about the issue. Commented Apr 10, 2021 at 4:26
  • @PankajYadav In fact I did so, I enabled both CloudWatch Logs and Access Logging, but none of them provided additional information. Surprisingly, the log entries that correspond to the API request that caused the error don't even look like an error. Commented Apr 10, 2021 at 4:57
  • You are using exception handling inside your lambda function, right ? Commented May 6, 2021 at 9:09
  • Exactly, that's how I realize about the error codes. In fact, since my first post I've received some additional errors like: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) or <Response [502]> and <Response [503]> among others. Commented May 7, 2021 at 13:08

3 Answers 3

11

I have the same problem and in my case I had to enable Log full requests/responses data together with INFO logs on the API Gateway stage to see the following logs:

(xxx) Endpoint response body before transformations: { "Type": "Service", "message": "INFO: Lambda is initializing your function. It will be ready to invoke shortly." } 

In my case the issue was related to the fact that the lambda was in Inactive state, which happens If a function remains idle for several weeks.

Sign up to request clarification or add additional context in comments.

2 Comments

What's the recommended way to handle this in code that invokes a Lambda? It's a bummer that AWS has this as an "error" that client libraries like boto3 will throw exceptions for even though the Lambda will run eventually. I almost want to say 202 makes more sense than 500. 503 would be better even, but since the Lambda does eventually invoke, 5xx doesn't make sense to me. Anyway, I guess I would surround the Lambda invokation code in a try-catch.
"As stated in the announcement post, Lambda precreates the ENIs required for your function to connect to your VPCs, which can take 60 to 90 seconds to complete. We will be changing this process slightly, by creating the required ENI resources while the function is placed in a Pending state and transitioning to Active after that process is completed. ". In case of VPC there is not much you can do, apparently... @jspinella
0

I have the same problem and I suspect a timeout maybe due to lambda reaching its memory limit.

I have set the memory limit to the next notch (128 -> 512) and augmented the timeout to 10s (default is 3), and now I'm able to see the timeout in action. I still have the problem for the moment but now I'll be able to investigate.

I hope that this helps you.

Comments

0

I see this with a HTTP API integration. It's intermittent, and it appears to improve when adding provisioned concurrency to the Lambda. For example, on a Lambda that has between 4 and 10 concurrent instances, but usually hovers in the 4 to 8 range, purchasing between 5 and 6 provisioned concurrent instances helped reduce, possibly eliminate, these 500 errors.

I am still monitoring to see whether they are gone for good. The frequency of these errors has gone down drastically with the provisioned instances.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.