124

Most of the time it happens that we load files in a common S3 bucket due to which it becomes hard to figure out data in it.

How can I view objects uploaded on a particular date?

12 Answers 12

94

One solution would probably to use the s3api. It works easily if you have less than 1000 objects, otherwise you need to work with pagination.

s3api can list all objects and has a property for the lastmodified attribute of keys imported in s3. It can then be sorted, find files after or before a date, matching a date ...

Examples of running such option

  1. all files for a given date
DATE=$(date +%Y-%m-%d) bucket=test-bucket-fh aws s3api list-objects-v2 --bucket "$bucket" \ --query 'Contents[?contains(LastModified, `'"$DATE"'`)]' 
  1. all files after a certain date
SINCE=`date --date '-2 weeks +2 days' +%F 2>/dev/null || date -v '-2w' -v '+2d' +%F` # ^^^^ GNU style ^^^^ BSD style bucket=test-bucket-fh aws s3api list-objects-v2 --bucket "$bucket" \ --query 'Contents[?LastModified > `'"$SINCE"'`]' 

s3api will return a few metadata so you can filter for specific elements

DATE=$(date +%Y-%m-%d) bucket=test-bucket-fh aws s3api list-objects-v2 --bucket "$bucket" \ --query 'Contents[?contains(LastModified, `'"$DATE"'`)].Key' 
Sign up to request clarification or add additional context in comments.

11 Comments

This should be a chosen answer. The other answers suggest ls the objects in the bucket then sorting, which will be expensive.
This is very promising but when I run the export command in bash I get the error message date: invalid option -- 'v'
@NicholasPorter awscli's --query option is implemented on the client side, no? I fail to see how this approach would be any cheaper than the other approaches suggested.
@Cerin yes It was tried and I believe 70+ users acknowledges it is working. your query Contents[?LastModified > ] does not contain anything to compare so yes it comes as an error, you can use plain text value to test Contents[?LastModified > 2022-01-01 ]
@WarrenParad of course it's a client-side filter, the s3api subcommand doesn't have any server-side --filter options like other AW queries do. And it "works" just fine. The OP didn't require a server-side solution. Alternatives like AWS Inventory take hours to process, and is not what OP requsted.
|
43

Search on a given date

aws s3api list-objects-v2 --bucket BUCKET_NAME --query 'Contents[?contains(LastModified, `YYYY-MM-DD`)].Key' 

Search from a certain date to today

aws s3api list-objects-v2 --bucket BUCKET_NAME --query 'Contents[?LastModified>=`YYYY-MM-DD`].Key' 

If you want to query a specific 'folder' in the bucket, you use the --prefix option. Eg. --prefix "folder/subfolder"

You can optionally remove the .Key from the end of the query to grab all metadata fields from the s3 objects

Note that if you want to include a time, it has to have a literal T in between the date and time components, i.e. 2024-01-01T12:00:00, not a space, or it will silently fail to find anything as of October 2024.

9 Comments

This is the best answer. It's simple and it works
Note if you want to query a specific 'folder' in the bucket, you use the --prefix option. Eg. --prefix "folder/subfolder"
you saved my day buddy 🙌
I really wish people would stop posting this same incorrect answer. Why would you even put backticks around the date? You're trying to execute the date? That's clearly invalid syntax.
@Cerin It actually is valid syntax; the backticks are in a string that's surrounded by single quotes, so they're treated as literal backticks as far as the shell is concerned. To the AWS CLI, the backticks are used to identify strings in a query clause (docs).
|
14

In case it helps anyone in the future, here's a python program that will allow you to filter by a set of prefixes, suffixes, and/or last modified date. Note that you'll need aws credentials set up properly in order to use boto3. Note that this supports prefixes that contain more than 1000 keys.

Usage:

python save_keys_to_file.py -b 'bucket_name' -p some/prefix -s '.txt' '.TXT' -f '/Path/To/Some/File/test_keys.txt' -n '2018-1-1' -x '2018-2-1' 

Code filename: save_keys_to_file.py:

 import argparse import boto3 import dateutil.parser import logging import pytz from collections import namedtuple logger = logging.getLogger(__name__) Rule = namedtuple('Rule', ['has_min', 'has_max']) last_modified_rules = { Rule(has_min=True, has_max=True): lambda min_date, date, max_date: min_date <= date <= max_date, Rule(has_min=True, has_max=False): lambda min_date, date, max_date: min_date <= date, Rule(has_min=False, has_max=True): lambda min_date, date, max_date: date <= max_date, Rule(has_min=False, has_max=False): lambda min_date, date, max_date: True, } def get_s3_objects(bucket, prefixes=None, suffixes=None, last_modified_min=None, last_modified_max=None): """ Generate the objects in an S3 bucket. Adapted from: https://alexwlchan.net/2017/07/listing-s3-keys/ :param bucket: Name of the S3 bucket. :ptype bucket: str :param prefixes: Only fetch keys that start with these prefixes (optional). :ptype prefixes: tuple :param suffixes: Only fetch keys that end with thes suffixes (optional). :ptype suffixes: tuple :param last_modified_min: Only yield objects with LastModified dates greater than this value (optional). :ptype last_modified_min: datetime.date :param last_modified_max: Only yield objects with LastModified dates greater than this value (optional). :ptype last_modified_max: datetime.date :returns: generator of dictionary objects :rtype: dict https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects """ if last_modified_min and last_modified_max and last_modified_max < last_modified_min: raise ValueError( "When using both, last_modified_max: {} must be greater than last_modified_min: {}".format( last_modified_max, last_modified_min ) ) # Use the last_modified_rules dict to lookup which conditional logic to apply # based on which arguments were supplied last_modified_rule = last_modified_rules[bool(last_modified_min), bool(last_modified_max)] if not prefixes: prefixes = ('',) else: prefixes = tuple(set(prefixes)) if not suffixes: suffixes = ('',) else: suffixes = tuple(set(suffixes)) s3 = boto3.client('s3') kwargs = {'Bucket': bucket} for prefix in prefixes: kwargs['Prefix'] = prefix while True: # The S3 API response is a large blob of metadata. # 'Contents' contains information about the listed objects. resp = s3.list_objects_v2(**kwargs) for content in resp.get('Contents', []): last_modified_date = content['LastModified'] if ( content['Key'].endswith(suffixes) and last_modified_rule(last_modified_min, last_modified_date, last_modified_max) ): yield content # The S3 API is paginated, returning up to 1000 keys at a time. # Pass the continuation token into the next response, until we # reach the final page (when this field is missing). try: kwargs['ContinuationToken'] = resp['NextContinuationToken'] except KeyError: break def get_s3_keys(bucket, prefixes=None, suffixes=None, last_modified_min=None, last_modified_max=None): """ Generate the keys in an S3 bucket. :param bucket: Name of the S3 bucket. :ptype bucket: str :param prefixes: Only fetch keys that start with these prefixes (optional). :ptype prefixes: tuple :param suffixes: Only fetch keys that end with thes suffixes (optional). :ptype suffixes: tuple :param last_modified_min: Only yield objects with LastModified dates greater than this value (optional). :ptype last_modified_min: datetime.date :param last_modified_max: Only yield objects with LastModified dates greater than this value (optional). :ptype last_modified_max: datetime.date """ for obj in get_s3_objects(bucket, prefixes, suffixes, last_modified_min, last_modified_max): yield obj['Key'] def valid_datetime(date): if date is None: return date try: utc = pytz.UTC return utc.localize(dateutil.parser.parse(date)) except Exception: raise argparse.ArgumentTypeError("Could not parse value: '{}' to type datetime".format(date)) def main(): FORMAT = '%(asctime)s - %(name)s - %(levelname)s - %(message)s' logging.basicConfig(format=FORMAT) logger.setLevel(logging.DEBUG) parser = argparse.ArgumentParser(description='List keys in S3 bucket for prefix') parser.add_argument('-b', '--bucket', help='S3 Bucket') parser.add_argument('-p', '--prefixes', nargs='+', help='Filter s3 keys by a set of prefixes') parser.add_argument('-s', '--suffixes', nargs='*', help='Filter s3 keys by a set of suffixes') parser.add_argument('-n', '--last_modified_min', default=None, type=valid_datetime, help='Filter s3 content by minimum last modified date') parser.add_argument('-x', '--last_modified_max', default=None, type=valid_datetime, help='Filter s3 content by maximum last modified date') parser.add_argument('-f', '--file', help='Optional: file to write keys to.', default=None) args = parser.parse_args() logger.info(args) keys = get_s3_keys(args.bucket, args.prefixes, args.suffixes, args.last_modified_min, args.last_modified_max) open_file = open(args.file, 'w') if args.file else None try: counter = 0 for key in keys: print(key, file=open_file) counter += 1 finally: open_file.close() logger.info('Retrieved {} keys'.format(counter)) if __name__ == '__main__': main() 

Comments

11

BTW this works on Windows if you want to search between dates

aws s3api list-objects-v2 --max-items 10 --bucket "BUCKET" --query "Contents[?LastModified>='2019-10-01 00:00:00'] | [?LastModified<='2019-10-30 00:00:00'].{ Key: Key, Size: Size, LastModified: LastModified }"

1 Comment

Using this method, I had to add the --prefix flag to get into nested directories. Question: How do I remove the prefix from the key? I want only the file name alone instead of "Key": "parent-dir1/parent-dir2/file_name".
11

If you have massive amounts of files (millions or billions of entries), the best way to go is to generate a bucket inventory using Amazon S3 Inventory, including the Last Modified field, and then query the generated inventory via Amazon Athena using SQL queries.

You can find a detailed walkthrough here: https://aws.amazon.com/blogs/storage/manage-and-analyze-your-data-at-scale-using-amazon-s3-inventory-and-amazon-athena/

3 Comments

Finally an answer that actually works at scale, while it might not be the best to handle fully managed, it is 100x better than the other "solutions" to this question. Please listen to Guillermo.
This is what I was looking for… except it can take up to 48 hours. My forensics problem can't take that long, unfortunately. I've set it up, though, for future use.
One challenge here is that it only updates daily, and can't be triggered immediately
10

This isn't a general solution, but can be helpful where your objects are named based on date - such as CloudTrail logs. For example, I wanted a list of objects created in June 2019.

aws s3api list-objects-v2 --bucket bucketname --prefix path/2019-06 

This does the filtering on the server side. The downside of using the "query" parameter is it downloads a lot of data to filter on the client side. This means potentially a lot of API calls, which cost money, and additional data egress from AWS that you pay for.

Source: https://github.com/aws/aws-sdk-js/issues/2543

Comments

9

The following command works in Linux.

aws s3 ls --recursive s3:// <your s3 path here> | awk '$1 > "2018-10-13 00:00:00" {print $0}' | sort -n 

I hope this helps!!!

1 Comment

Beware this approach will be expensive if you have a lot of objects in the bucket. You still get charged for listing objects. Check out @frédéric-henri answer below.
8

Looks like there is no API that lets you filter by modified date on the server-side. All the filtering seems to be happening on the client-side, so no matter what client you use (s3api, boto3, etc) it will be slow if you have to do that on a large number of files. There is no good option to parallelize that scan unless you can do that by running a list operation on different subfolders, but that's definitely not going to work for a lot of cases.

The only option I've found that actually gives you the ability to filter a large number of files by modified date is to use AWS S3 inventory - https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html. This way AWS runs the S3 files indexing for you and stores the files' metadata (i.e. file path, last modified date, size, etc) in a specified S3 location. You can easily use that to filter by modified date.

Comments

4

There are 2 places where the sorting can happen: server side or client side.


Client Side

This is what you would do when using the aws cli with the --query option. It fetches the list of all files in your bucket by group of 1000 keys. Then once all keys are pulled, your client orders them by dates. This is extremely slow on buckets with a high number of files. The cli uses what is referred as pagination in the various aws sdk.

aws s3api list-objects --bucket BUCKET_NAME --query 'sort_by(Contents, &LastModified)[]' 

Server Side

Without extra infrastructure it's not possible to do it. @Guillermo Gutiérrez explained it perfectly.

If you have massive amounts of files (millions or billions of entries), the best way to go is to generate a bucket inventory using Amazon S3 Inventory, including the Last Modified field, and then query the generated inventory via Amazon Athena using SQL queries.

You can find a detailed walkthrough here: https://aws.amazon.com/blogs/storage/manage-and-analyze-your-data-at-scale-using-amazon-s3-inventory-and-amazon-athena/

Comments

2

If versioning is enabled for the bucket and you want to restore latest deleted objects after a specific date, this is the command:

$ aws s3api list-object-versions --bucket mybucket --prefix myprefix/ --output json --query 'DeleteMarkers[?LastModified>=`2020-07-07T00:00:00` && IsLatest==`true`].[Key,VersionId]' | jq -r '.[] | "--key '\''" + .[0] + "'\'' --version-id " + .[1]' |xargs -L1 aws s3api delete-object --bucket mybucket 

That means that you have aws cli (I used v. 2.0.30) and jq installed.


If you want to be sure before deleting that everything is ok, just use echo before aws:

$ aws s3api list-object-versions --bucket mybucket --prefix myprefix/ --output json --query 'DeleteMarkers[?LastModified>=`2020-07-07T00:00:00` && IsLatest==`true`].[Key,VersionId]' | jq -r '.[] | "--key '\''" + .[0] + "'\'' --version-id " + .[1]' |xargs -L1 echo aws s3api delete-object --bucket mybucket > files.txt 

Note that because of echo, quotes will be not applied correctly and saved in the file without it. That's ok if there are no spaces in paths. You can check that file and if everything is ok, run in this way:

$ cat files.txt | bash 

Comments

1

If versioning is enabled for the bucket and you want to list last modified objects after a specific date, this is the command:

$ aws s3api list-objects-v2 --bucket "bucket_name" --prefix "prefix" --query "Contents[?LastModified>='2023-01-23'].{key: Key, date: LastModified}" 

Comments

0

When we are not sure about start date. Just need top 10 objects created at start.

aws s3api list-objects --bucket your-bucket-name --prefix your-prefix --query "Contents[?LastModified!=null].{Key: Key, LastModified: LastModified}" --output json | grep -v '"LastModified": null' | jq 'sort_by(.LastModified) | .[:10]'

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.