6

I have 2 FIFO SQS queues which receives JSON messages that are to be indexed to elasticsearch. One queue is constantly adding delta changes to the database and adding them to the queue. The second queue is used for database re-indexing i.e. the entire 50Tb if data is to be indexing every couple of months (where everything is added to the queue). I have a lambda function that consumes the messages from the queues and places them into the appropriate queue (either the active index or the indexing being rebuilt).

How should I trigger the lambda function to best process the backlog of messages in SQS so it process both queues as quickly as possible?

A constraint I have is that the queue items need to be processed in order. If the lambda function could be run indefinitely without the 5 minute limit I could keep running one function that constantly processes messages.

2
  • am I understanding correctly: you have a few million jobs every few months. You want to run the jobs serially, so no parallelism, correct? Commented Feb 23, 2018 at 21:02
  • I just updated the question with additional details on what the queues are used for and how the process works. Commented Feb 23, 2018 at 21:21

3 Answers 3

1

Instead of pushing your messages directly into SQS you could publish the messages to a SNS Topic with 2 Subscriber registered.

  1. Subscriber: SQS
  2. Subscriber: Lambda Function

Has the benefit that your Lambda is invoked at the same time as the message is stored in SQS.

Sign up to request clarification or add additional context in comments.

1 Comment

I'd rather not added an additional layer to this if possible as it wil increase the complexity and cost.
1

The standard way to do this is to use Cloudwatch Events that run periodically. This lets you pull data from the queue on a regular schedule.

Because you have to poll SQS this may not lead to the fastest processing of messages. Also, be careful if you constantly have messages to process - Lambda will end up being far more expensive than a small EC2 instance to handle the messages.

6 Comments

Periodically running the lambda function won't work for me as I'll be re-indexing a massive DB (100s millions of documents) so I can't afford to not have messages being processed (i.e. the time between the lambda ending and the next one beginning).
@CorribView why do you want to use Lambda? Woulnd't be EC2 a better match as it seems you only want one concurrent worker anyway and it would need to run constantly?
I'd have to agree with @hansaplast - Lambda may not be the best choice. If you want to minimize the maintenance you could use the Elastic Beanstalk Worker environment that would allow scalability and would be nearly real time. Additionally you could tweak the size of the instances if the throughput isn't what you want. But, of course, you could just have an EC2 to handle it too.
It will only need to be constantly be running occasionally (once every 2 months) when the index is re-indexing. We're also working on moving away from EC2 instances in the medium term and re-engineering our design to be server-less with micro-services.
@CorribView - so spin up an EC2, have it do what is needed, and shut it down. You are barely charged when an EC2 isn't running (depending on how much EBS you use) and it will ultimately be more timely and cost efficient. Serverless doesn't handle every use case in my opinion.
|
0

Not sure I fully understand your problem, but here are my 2 cents:

  1. If you have a constant and real-time stream of data, consider using Kinesis Streams with 1 shard in order to preserve the FIFO. You may consume the data in batch of n items using lambda. Up to you to decide the batch size n and the memory size of lambda.

    • with this solution you pay a low constant price for Kinesis Streams and a variable price for Lambdas.
  2. Should you really are in love with SQS and the real-time does not metter, you may consume items with Lambdas or EC2 or Batch. Either you trigger many lambdas with CloudWatch Events, either you keep alive an EC2, either you trigger on a regular basis an AWS Batch job.

    • there is an economic equation to explore, each solution is the best for one use case and the worst for another, make your choice ;)
    • I prefer SQS + Lambdas when there are few items to consume and SQS + Batch when there are a lot of items to consume.
  3. You may probably also consider using SNS + SQS + Lambdas like @maikay says in his answer, but I wouldn't choose that solution.

Hope it helps. Feel free to ask for clarifications. Good luck!

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.