Removing big files from Git history

Question

I've read multiple answers advising on using either filter-branch or BFG to accomplish this task, but I feel I need further advice because my situation is a bit peculiar.

I have to manage two repositories, one is basically a clone of the other, and ideally, I'd want to pull the changes from the origin into the clone on a daily basis. However, the origin repo contains very big files in its history, which are above Github's size limits. So I have to remove these files, but at the same time, I don't want to harm the existing commit history beyond the changes to those specific files. From what I understand, BFG performs a complete rewrite of the history, which will fool Github into thinking that all existing files were deleted and recreated as new files, whereas filter-branch doesn't do that, but it's also extremely slow by comparison, and my repository is very big reaching about 100000 commits...

So I'm trying to figure out what's the best way to go about this. Should I use BFG at certain points, and simply accept that I'm gonna see ridiculous pull requests as a result of its modifications, or maybe I should use filter-branch in some manner? To clarify, there are only 3 files which are the cause of this grievance.

There is no way around rewriting the git history if you want to remove these files. Git has a few other tools to handle large repos, and you might be able to use some of them. You can clone just part of a repo with git clone --depth. You can also use git clone --reference, to clone using a local repo to speed up the transfers. Note that --reference have some very special edge cases, so read through it carefully before using it, to decide if it is what you want. — MrBerta
– MrBerta, Commented Nov 25, 2019 at 10:32
Yes, I've read on git clone --depth and git pull --depth before but failed to fully comprehend them. If I were to pull these shallow changes only, would that help me avoid putting these large, historical files that don't exist anymore? — Chen Leikehmacher
– Chen Leikehmacher, Commented Nov 25, 2019 at 11:24
Yes. The --depth flag is used to reduce the load on the server, so it will not download anything that is not needed for the commits of the given depth. If your large files are further back in history than the depth, they will not be downloaded. — MrBerta
– MrBerta, Commented Nov 25, 2019 at 11:34

torek · Accepted Answer · 2019-11-25 09:55:32Z

Commit history in Git is nothing but commits.

No commit can ever be changed. So for anything to remove a big file from some existing commit, that thing—whether it's BFG, or git filter-branch, or git filter-repo, or whatever—is going to have to extract a "bad" commit, make some changes (e.g., remove the big file), and make a new and improved substitute commit.

The terrible part of this is that each subsequent commit encodes, in an unchangeable way, the raw hash ID of the bad commit. The immediate children of the bad commit encode it as their parent hash. So you—or the tool—must copy those commits to new-and-improved ones. What's improved about them is that they lack the big file and refer back to the replacement they just made for the initial bad commit.

Of course, their children encode their hash IDs as parent hash IDs, so now the tool must copy those commits. This repeats all the way up to the last commit in each branch, as identified by the branch name:

...--o--o--x--o--o--o [old, bad version of branch] \ ●--●--●--● <-- branch

where x is the bad commit: x had to be copied to the first new-and-improved ● but then all subsequent commits had to be copied too.

The copies, being different commits, have different hash IDs. Every clone must now abandon the "bad" commits—the x one and all its descendants—in favor of the new-and-improved ones.

All these repository-editing tools should strive to make minimal changes. The BFG is probably the fastest and most convenient to use, but git filter-branch can be told to copy only all bad-and-descendant commits and to use --index-filter, which is its fastest (still slow!) filter. To do this, use:

git filter-branch --index-filter <command> -- <hash>..branch1 <hash>..branch2 ...

where the <command> is an appropriate "git rm --cached --ignore-unmatch" command (be sure to quote the whole thing) and the <hash> and branch names specify which commits to copy. Remember that A..B syntax means don't look at commit A or earlier, while looking at commits B and earlier so if commit x is, say, deadbeefbadf00d..., you'll want to use the hash of its parent as the limiter:

git filter-branch --index-filter "..." -- deadbeefbadf00d^..master

for instance (fill in the ... part with the right removal command).

(Note: I have not actually used The BFG, but if it re-copies commits unnecessarily, that's really bad, and I bet it does not.)

So if understand this correctly, if these files existed as part of the master branch two months ago, that automatically means ALL commits henceforth from that point must be rewritten as new commits...? Which means that whether I used BFG or filter-branch, the amount of rewrite is gonna end up the same?
Pretty much, yes. (Though note that "from that point" is specifically tied to ancestor/descendant relationships in the graph. For instance, if you have an old version that you've been patching, and you patched it, and none of the commits on that branch have the big files, it doesn't matter if the commit was yesterday: what matters is that the commit, and its parent, and its grandparent, and so on through time never had the big files.)
Ok, thanks for clearing this up for me... I think I'll try using the --depth option that MrBerta suggested, since I don't mind losing the history from the origin repository.

Collectives™ on Stack Overflow

Removing big files from Git history

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related