4

In an S3 bucket, I have thousands and thousands of files stored with names having a structure that comes down to prefix and number:

A-0001 A-0002 A-0003 B-0001 B-0002 C-0001 C-0002 C-0003 C-0004 C-0005 

New objects for a given prefix should come in with varying frequency, but might not. Older objects may disappear.

Is there a way to efficiently query S3 for the highest number of every prefix, i.e. without listing the entire bucket? The result I want is:

A-0003 B-0002 C-0005 

The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it? So far I have only found it capable of searching within objects, but all I care about are their key names. If it can report on the contents of objects in the bucket, can't it on the bucket itself?

I would be okay with the latest modification date per prefix, but I want to avoid having to switch to a versioned bucket with just the prefixes as names to achieve that.

3
  • would something like this help, stackoverflow.com/questions/45429556/… ? Commented Dec 14, 2017 at 2:31
  • @jamohenneman I'm afraid not. The latest modification date per prefix would be fine too, but I can't query for any specific date (range), because for each I want the highest number overall, and the rates of increment differ. Commented Dec 14, 2017 at 2:36
  • 1
    There is no way to do this without listing the entire bucket. Commented Dec 14, 2017 at 2:53

3 Answers 3

1

I think this is what you are looking for:

variable name is $path and you can regexp to get the pattern you are querying...

WHERE regexp_extract(sp."$path", '[^/]+$') like concat('%',cast(current_date - interval '1' day as varchar),'.csv')

Sign up to request clarification or add additional context in comments.

1 Comment

Is this an Athena query? More context would be appreciated
0

The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it?

Yes at the moment, there is not direct way of doing it only with AWS S3. Even with Athena, it will go through the files to query their content but it will be easier using standard SQL support with Athena and would be faster since the queries runs in parallel.

So far I have only found it capable of searching within objects, but all I care about are their key names.

Both Athena and S3 Select is to query by content not keys.

The best approach I can recommend is to use AWS DynamoDB to keep the metadata of the files, including file names for faster querying.

3 Comments

What about a Lambda function that triggers on new file creation, and copies the file to another S3 bucket, when a new version gets updated you remove the old one and copy the new one ?
Didn't get your point. Can you explain it further on how moving file to another bucket relates with querying?
If you have a separate bucket with the last 3 files of each prefix, you can just query that bucket and have an up to date list, and the latest version of the files to do other things with as a Bonus, rather then using a DynamoDB :)
0

Use S3 Inventory to create a list of objects and Athena to query it:

https://aws.amazon.com/blogs/storage/manage-and-analyze-your-data-at-scale-using-amazon-s3-inventory-and-amazon-athena/

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.