Is there a way to query S3 object key names for the latest per prefix?

Question

In an S3 bucket, I have thousands and thousands of files stored with names having a structure that comes down to prefix and number:

A-0001 A-0002 A-0003 B-0001 B-0002 C-0001 C-0002 C-0003 C-0004 C-0005

New objects for a given prefix should come in with varying frequency, but might not. Older objects may disappear.

Is there a way to efficiently query S3 for the highest number of every prefix, i.e. without listing the entire bucket? The result I want is:

A-0003 B-0002 C-0005

The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it? So far I have only found it capable of searching within objects, but all I care about are their key names. If it can report on the contents of objects in the bucket, can't it on the bucket itself?

I would be okay with the latest modification date per prefix, but I want to avoid having to switch to a versioned bucket with just the prefixes as names to achieve that.

would something like this help, stackoverflow.com/questions/45429556/… ? — jarnohenneman
– jarnohenneman, Commented Dec 14, 2017 at 2:31
@jamohenneman I'm afraid not. The latest modification date per prefix would be fine too, but I can't query for any specific date (range), because for each I want the highest number overall, and the rates of increment differ. — Thijs van Dien
– Thijs van Dien, Commented Dec 14, 2017 at 2:36
There is no way to do this without listing the entire bucket. — helloV
– helloV, Commented Dec 14, 2017 at 2:53

user9194890 · Accepted Answer · 2018-01-09 18:53:51Z

I think this is what you are looking for:

variable name is $path and you can regexp to get the pattern you are querying...

WHERE regexp_extract(sp."$path", '[^/]+$') like concat('%',cast(current_date - interval '1' day as varchar),'.csv')

Ashan · Accepted Answer · 2017-12-14 04:16:50Z

The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it?

Yes at the moment, there is not direct way of doing it only with AWS S3. Even with Athena, it will go through the files to query their content but it will be easier using standard SQL support with Athena and would be faster since the queries runs in parallel.

So far I have only found it capable of searching within objects, but all I care about are their key names.

Both Athena and S3 Select is to query by content not keys.

The best approach I can recommend is to use AWS DynamoDB to keep the metadata of the files, including file names for faster querying.

What about a Lambda function that triggers on new file creation, and copies the file to another S3 bucket, when a new version gets updated you remove the old one and copy the new one ?
Didn't get your point. Can you explain it further on how moving file to another bucket relates with querying?
If you have a separate bucket with the last 3 files of each prefix, you can just query that bucket and have an up to date list, and the latest version of the files to do other things with as a Bonus, rather then using a DynamoDB :)

Craig Younkins · Accepted Answer · 2023-03-11 03:37:04Z

Use S3 Inventory to create a list of objects and Athena to query it:

https://aws.amazon.com/blogs/storage/manage-and-analyze-your-data-at-scale-using-amazon-s3-inventory-and-amazon-athena/

Collectives™ on Stack Overflow

Is there a way to query S3 object key names for the latest per prefix?

3 Answers 3

1 Comment

3 Comments

Comments

Linked

Hot Network Questions