How can I find the N largest files in a Git repository?

Question

I wanted to find the 10 largest files in my repository. The script I came up with is as follows:

REP_HOME_DIR=<top level git directory> max_huge_files=10 cd ${REP_HOME_DIR} git verify-pack -v ${REP_HOME_DIR}/.git/objects/pack/pack-*.idx | \ grep blob | \ sort -r -k 3 -n | \ head -${max_huge_files} | \ awk '{ system("printf \"%-80s \" `git rev-list --objects --all | grep " $1 " | cut -d\" \" -f2`"); printf "Size:%5d MB Size in pack file:%5d MB\n", $3/1048576, $4/1048576; }' cd -

Is there a better/more elegant way to do the same?

By "files" I mean the files that have been checked into the repository.

Is this really for any directory, or is there something specific about git you are trying to figure out? By you pipes, I assume any unix command is ok? — Brian Maltzan
– Brian Maltzan, Commented Feb 26, 2012 at 20:12
Do you mean files being tracked or files on disk? They might not correlate in a way you expect. — Daenyth
– Daenyth, Commented Feb 27, 2012 at 0:48
@Sumit: What version of them? If it's a binary file that's changed, you'll have both copies in the repo. — Daenyth
– Daenyth, Commented Mar 7, 2012 at 19:56
You might look at this: stackoverflow.com/questions/298314/… — Daenyth
– Daenyth, Commented Mar 8, 2012 at 14:47

Peter Mortensen · Accepted Answer · 2023-09-20 09:23:21Z

This Bash "one-liner" displays the 10 largest blobs in the repository, sorted from smallest to largest. In contrast to the other answers, this includes all files tracked by the repository, even those not present in any branch tip.

It's very fast, easy to copy & paste and only requires standard GNU utilities.

git rev-list --objects --all \ | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \ | sed -n 's/^blob //p' \ | sort --numeric-sort --key=2 \ | tail -n 10 \ | cut -c 1-12,41- \ | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

The first four lines implement the core functionality, the fifth limits the number of results, while the last two lines provide the nice human-readable output that looks like this:

... 0d99bb931299 530KiB path/to/some-image.jpg 2ba44098e28f 12MiB path/to/hires-image.png bd1741ddce0d 63MiB path/to/some-video-1080p.mp4

For more information, including further filtering use cases and an output format more suitable for script processing, see my original answer to a similar question.

macOS users: Since numfmt is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.

When I do this it displays files that are not in my branch tip. The top answer above, by contrast, does not.
@nfernand That is intended. OP says "By 'files' I mean the files that have been checked into the repository.". For me, that includes files that are not in the current branch's tip. If you need something else, go with the top answer above.
I used @raphinesse answer to identify files that caused remote: error: File foo is 114.34 MB; this exceeds GitHub's file size limit of 100.00 MB. The file foo was deleted some commits back, but it's blob still existed in the history. To remove it, I had to use BFG Repo Cleaner.
And to do that, I used git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | awk '/^blob/ {print substr($0,6)}' | awk '$2 >= 100*1024^2' | sort --numeric-sort --key=2 --reverse | cut --complement --characters=13-40 | numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Community · Accepted Answer · 2017-05-23 10:31:31Z

108

I found another way to do it:

git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10

Quoted from: SO: git find fat commit

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Feb 28, 2015 at 16:12

ypid

1,8881 gold badge15 silver badges11 bronze badges

4 Comments

dumbledad Over a year ago

Great answer - I'd change it a bit to git ls-tree -r -l --abbrev --full-name HEAD | sort -n -r -k 4 | head -n 10: you do not need -t as you are not interested in tree listings (which give no size), --abbrev makes the hash more readable, and head instead of tail gives the list in descending size order

Amedee Van Gasse Over a year ago

Doesn't work for files that aren't checked out, or for files that existed some time in the past but were deleted but are still taking up space in a blob somewhere. See raphinesse's answer for a better solution.

MvanGeest Over a year ago

This answer does work for files that aren't checked out - it processes the files in the given commit (HEAD in the example) as they appear in that commit, regardless of whether the commit is currently checked out.

JESii Over a year ago

This absolutely does not work for me; the @raphinesse answer is the one. This listed a 167K file as the largest while raphinesse listed a 75M file... and did it with human-readable values.

Peter Mortensen · Accepted Answer · 2023-09-20 09:08:54Z

37

Use

git ls-files | xargs ls -l | sort -nrk5 | head -n 10

git ls-files: List all the files in the repo
xargs ls -l: perform ls -l on all the files returned in git ls-files
sort -nrk5: Numerically reverse sort the lines based on 5th column
head -n 10: Print the top 10 lines

edited Sep 20, 2023 at 9:08

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Feb 29, 2012 at 18:07

pranithk

5593 silver badges2 bronze badges

6 Comments

Sumit Over a year ago

This is based on files in my current checkout version. If a file of 10MB was committed and then it was overwritten by another version of 1KB, this command won't list that file.

ben.snape Over a year ago

This answer fails when you have spaces in filenames/folders (although you do get some output). This minor change fixes that problem: git ls-files -z | xargs -0 ls -l | sort -nrk5 | head -n 10

Hans Over a year ago

I would use a more simplified form: git ls-files -z | xargs -0 ls -l -h -S -r. This should get you a list of all files within the repo ordered from smallest to largest with human readable sizes. If you want to truncate the list, you can use head or tail to help. My 5 cents worth...

K. Shores Over a year ago

Had to us this on my mac: git ls-files | xargs -0 | tr '\n' '\0' | xargs -0 du -h | sort -rh

juanpa.arrivillaga Over a year ago

Note, ls has an S option that sorts by file size already! So you can use git ls-files | xargs ls -lhS |head -n 10 and have a nice, human-readable output!

|

Peter Mortensen · Accepted Answer · 2023-09-20 09:35:33Z

15

I cannot comment. ypid's answer modified for PowerShell:

git ls-tree -r -l --abbrev --full-name HEAD | Sort-Object {[int]($_ -split "\s+")[3]} | Select-Object -last 10

A modified version of raphinesse's solution(ish):

git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | Where-Object {$_ -like "blob*"} | Sort-Object {[int]($_ -split "\s+")[2]} | Select-Object -last 10

edited Sep 20, 2023 at 9:35

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Sep 4, 2019 at 18:04

AdamF

3173 silver badges8 bronze badges

3 Comments

Amedee Van Gasse Over a year ago

Doesn't work for files that aren't checked out, or for files that existed some time in the past but were deleted but are still taking up space in a blob somewhere. See raphinesse's answer for a better solution.

Venryx Over a year ago

Why was this downvoted? It is FINALLY an answer that works out of the box on Windows.

contool Over a year ago

This is the only one that worked for me on Windows without the PowerShell complaining

Peter Mortensen · Accepted Answer · 2023-09-20 09:24:37Z

6

An improvement to raphinesse's answer, sort by size with largest first:

git rev-list --objects --all \ | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \ | awk '/^blob/ {print substr($0,6)}' \ | sort --numeric-sort --key=2 --reverse \ | head \ | cut --complement --characters=13-40 \ | numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

edited Sep 20, 2023 at 9:24

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Nov 15, 2017 at 17:00

studog

1,11712 silver badges15 bronze badges

3 Comments

raphinesse Over a year ago

I would hardly call that an improvement. It's more a matter of preference. Plus, if someone just wanted to reverse the line order, they could simply pipe the output of my original script through tac.

Econ Over a year ago

@raphinesse For me, this does not simply sort raphinesse's answer in a different order. It also lists a different set of files (in particular, it lists a large file that raphinesse's answer does not list). Is this intended?

studog Over a year ago

It is not intended; I really just changed the sort order to largest to smallest. Try running just the first 4 lines (to the end of the sort command) of each command and see if there's a difference between the outputs.

UnionP · Accepted Answer · 2024-03-05 21:35:02Z

On Windows, I started with AdamF's answer (thanks!) and modified it to handle files with spaces in the path, and also to output objects instead of strings:

git rev-list --objects --all | git cat-file --batch-check='%(objecttype)|%(objectname)|%(objectsize)|%(rest)' | Where-Object {$_ -like "blob*"} | % { $tokens = $_ -split "\|"; [pscustomobject]@{ Hash = $tokens[1]; Size = [int]($tokens[2]); Name = $tokens[3] } } | Sort-Object -Property Size -Descending | Select-Object -First 50

Even better, if you want to output the file sizes with nice file size units, you can add the DisplayInBytes function from here to your environment, and then pipe the above to:

Format-Table Hash, Name, @{Name="Size";Expression={ DisplayInBytes($_.Size) }}

This gives you output like:

Hash Name Size ---- ---- ---- f51371aa843279a1efe45ff14f3dc3ec5f6b2322 types/react-snackbar 95.8 MB 84f3d727f6b8f99ab4698da51f9e507ae4cd8879 .ntvs_analysis.dat 94.5 MB 17d734397dcd35fdbd715d29ef35860ecade88cd fhir/fhir-tests.ts 11.5 KB 4c6a027cdbce093fd6ae15e65576cc8d81cec46c fhir/fhir-tests.ts 11.4 KB

Lastly, if you'd like to get all the largest file types, you can do so with:

git rev-list --objects --all | git cat-file --batch-check='%(objecttype)|%(objectname)|%(objectsize)|%(rest)' | Where-Object {$_ -like "blob*"} | % { $tokens = $_ -split "\|"; [pscustomobject]@{ Size = [int]($tokens[2]); Extension = [System.IO.Path]::GetExtension($tokens[3]) } } | Group-Object -Property Extension | % { [pscustomobject]@{ Name = $_.Name; Size = ($_.Group | Measure-Object Size -Sum).Sum } } | Sort-Object -Property Size -Descending | select -First 20 -Property Name, @{Name="Size";Expression={ DisplayInBytes($_.Size) }}

Peter Mortensen · Accepted Answer · 2023-09-20 09:49:01Z

Adding my two cents on how to do this for the whole repository history (useful before BFGing out the large blobs committed by accident):

git rev-list --all | while read rev ; do git ls-tree -rl --full-name $rev ; done | sort -k4 -nr | uniq

Example output (from the dte repository from GitHub) reveals that there's one screenshot in history that may probably be removed to keep the whole repository a bit smaller:

100644 blob 3147cb8d0780442f70765a005f1a114442f24e9b 67942 Documentation/screenshot.png 100644 blob 36ea7701a6d58185800e22c39cac78d979f4375a 62575 Documentation/screenshot.png 100644 blob c0cd355f06a093cd762339b76f0e726edf22fca1 49046 src/command.c 100644 blob 76d20c2e4a80cd3f417d15c130ee6968e99d6d7f 48601 src/command.c 100644 blob c476fbf2fda71ebd4b337e62fb76922d18aeb1f3 48588 src/command.c 100644 blob 24465d1fab54e48817780338f8206baf47e98091 48451 src/command.c 100644 blob 74494b6020b2eff223dfaeed39bbfca414f2b359 48429 src/command.c 100644 blob fb8f13abe39ca8ff0e98aa65f95c336c9253b487 47838 src/command.c 100644 blob c2ce190eb428c3aeb12d40cf902af2a433324dee 47835 src/command.c ...

...but this precise repository is okay; no blobs of extreme size were found.

How to find the commits that work with the objects (adding for my own reference, haha):

git log --all --find-object=3147cb8d07

First Zero · Accepted Answer · 2012-02-26 20:41:07Z

1

You can also use du - Example: du -ah objects | sort -n -r | head -n 10 . du to get the size of the objects, sort them and then picking the top 10 using head.

answered Feb 26, 2012 at 20:41

First Zero

22.5k6 gold badges48 silver badges45 bronze badges

2 Comments

Sumit Over a year ago

only tells the object not the name of the file. Will require tricks to ignore the pack files as well.

Amedee Van Gasse Over a year ago

Doesn't work for files that aren't checked out, or for files that existed some time in the past but were deleted but are still taking up space in a blob somewhere. See raphinesse's answer for a better solution.

tsvikas · Accepted Answer · 2020-01-13 13:51:22Z

For completion, here's the method I found:

ls -lSh `git ls-files` | head

The optional -h prints the size in human-readable format.

Joey Adams · Accepted Answer · 2016-04-13 19:30:12Z

You can use find to find files larger than a given threshold, then pass them to git ls-files to exclude untracked files (e.g. build output):

find * -type f -size +100M -print0 | xargs -0 git ls-files

Adjust 100M (100 megabytes) as needed until you get results.

Minor caveat: this won't search top-level "hidden" files and folders (i.e. those whose names start with .). This is because I used find * instead of just find to avoid searching the .git database.

I was having trouble getting the sort -n solutions to work (on Windows under Git Bash). I'm guessing it's due to indentation differences when xargs batches arguments, which xargs -0 seems to do automatically to work around Windows' command-line length limit of 32767.

Doesn't work for files that aren't checked out, or for files that existed some time in the past but were deleted but are still taking up space in a blob somewhere. See raphinesse's answer for a better solution.

Collectives™ on Stack Overflow

How can I find the N largest files in a Git repository?

10 Answers 10

8 Comments

4 Comments

6 Comments

3 Comments

3 Comments

Comments

Comments

2 Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

8 Comments

4 Comments

6 Comments

3 Comments

3 Comments

Comments

Comments

2 Comments

Comments

1 Comment

Linked

Related