Size of .git folder order of magnitude greater than committed files

Question

I have a folder project with large number of files and subfolders. I have created a repository of this folder via git init to obtain the folder structure below.

project --- .git/ --- large number of files and folders .gitignore'd --- very few text and related files not under .gitignore --- .gitignore

The very first (and thus far only) commit in the repository only contained a few text and related files not .gitignore'd.

The raw size of the committed files on my disk (working tree) is just a few kilobytes.

More specifically, the committed files are:

3 .tex files of total size 9 KB 4 .lyx files of total size 32 KB 1 .gitignore file of size 1 KB 3 other .txt files of total size 4 KB

Yet, at this stage, the raw size of the .git folder is 84 MB. The size of the project folder itself is around 5 GB, most of which are .gitignore'd.

Is there a way I can try to figure out what is causing this large gap between the actual committed files and the size of the .git folder?

Can you identify which subfolder/files in the .git directory are large? (Likely the objects folder.) — TTT
– TTT, Commented Apr 2, 2022 at 14:56
"the actual committed files" How do you know what files are in this commit? Did you look, or are you guessing? — matt
– matt, Commented Apr 2, 2022 at 14:58
Seeing that you only have a few files which you want tracked, did you try running a git ls-files command to make sure that there is nothing included that you don't want. — sgmoore
– sgmoore, Commented Apr 2, 2022 at 15:01

TTT · Accepted Answer · 2022-04-02 15:45:03Z

If you made a prior commit with many more files, and then re-wrote it, the commit is still in the repo until it is garbage collected. But I'll take your word for it that you didn't do that:

The very first (and thus far only) commit in the repository only contained a few text and related files not .gitignore'd.

Therefore, the simplest explanation for this is that you staged a large number of files before getting your .gitignore file setup properly. Even staging files without committing them will take up space in the repository, at least temporarily. You can easily prove this is the cause with the prune command:

git prune -n # dry run, show what would be removed # and to actually do it git prune

Then check your repo size again.

Side Note: under normal circumstances you don't need to run the prune command because it happens automatically during garbage collection, however the prune default is 2 weeks. So if you wish to use gc to force a full pruning, then you could use:

git gc --prune=now

Side Side Note: I always advise people to commit early and often, because if they ever really mess something up, they can traverse their reflog to find old unreachable commits to recover lost (but previously committed) work. Since by default even unpacked objects sit around for 2 weeks, you could potentially recover files that were only staged in the last 2 weeks but never committed.

This worked exactly! Before this, the largest folder was the objects folder within git. After the above, the size is down to about 45 KB!
@Tryer Unfortunately, probably not. blobs are file contents without the filenames. Maybe you can deduce what the filenames were by viewing their contents. See the link in the "Side Side Note" for details.
@Tryer: if you know about Unix file systems (directories and files and inodes), Git's storage model will make a bit more sense. Inside the repository database there are four object types: tag (annotated tag), commit, tree, and blob. Ignoring tags as uninteresting, take a look at your current commit with git cat-file -p HEAD. Note that what you see is only the metadata but there is exactly one tree line. Copy-paste the hash ID shown and run git cat-file -p <hash>.( The tree object is binary but cat-file -p makes it readable.)
The tree object corresponds pretty well to a Linux directory: it has "tree entries" giving a file name, file type (mode), and blob hash ID. Instead of storing the type/mode in an inode, Git stores them directly in the directory entry. The blob hash ID is then roughly equivalent to an inode number: git cat-file -p of a blob hash ID spills out the content. A tree entry can also be another tree, which provides the mappings for file names: path/to/file is a tree entry path with mode and hash leading to a tree that holds to that leads to a tree that holds file (and blob hash ID).
The curious thing here is that Git first reads the tree into Git's index, which is essentially a flattened set of trees minus the directories, with the path names strung together, so that we have only path/to/file and the mode and other cache data. It's the index's lack of support for "directories" that makes Git unable to store an empty directory: Git builds the next commit from the index, and the index never has a directory entry; Git synthesizes a new set of trees based on the stored path names in the index.

Collectives™ on Stack Overflow

Size of .git folder order of magnitude greater than committed files

1 Answer 1

9 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Linked

Related