Purging data from git history in a large project

Question

I am wondering what the side-effects of using git filter-branch ... 'git rm ..' to purge data from repository history (see this link) are for a large distributed project.

Our project is based in github, where there are several important forks each having many users, and I would like to minimize impact.

Thanks!

EDIT: My goal is to remove sensitive data. While there is no "putting the cat back in the bag", it is important for us to remove this from the repo.

There is also specialized BFG repo cleaner, faster than git filter-branch and easier to use, but limited to this specific functionality rtyley.github.io/bfg-repo-cleaner — Jakub Narębski
– Jakub Narębski, Commented Jun 17, 2014 at 22:29

Chris · Accepted Answer · 2014-06-18 00:08:31Z

You don't say why you want to purge data, but there are two main reasons.

The first (and most common in my experience) is that you've got some sensitive information to remove. In fact, that's what the page you linked to uses for its title.

On a large, widely distributed project I would strongly avoid this.

This invalidates commit IDs, causing huge headaches (see below) for every contributor.
The information that you are trying to protect is already out there, and should be automatically considered compromised.

The fact that this is a large repository with many forks and contributors just makes these points more important. Instead of rewriting the project's history, just remove the file(s) in all active branches, change the keys / passwords / whatever, and move on.

The other main reason to purge data is to reduce the size of your repository, e.g. to fit into GitHub's repository size limit. If this is your goal you may not have as many options, as there are only so many ways do to this.

Edit about modifying shared history:

Here's the main problem with modifying shared history: A commit's hash is calculated based on a number of things, including file content, the timestamp of the commit, the commit message, and the parent commits' hashes.

That last one is important—if you modify one commit in any way, its hash changes. If one commit's hash changes, all of its descendent commits' hashes change.

So imagine your repository. There are lots and lots of copies of it all over the place. Each contributor in each fork has a copy, and each of those copies shares history. When you push commits to GitHub, they can be resolved against GitHub's existing commits because the graph formed by the commits is the same. It has the same structure, and all of the old commit hashes match up. And when I fetch from GitHub the same logic lets me get updated commits. Each forked copy works on the same principle.

Now imagine that you use filter-branch on the master branch of your repository to remove some file. The earliest commit that contains that file will be changed, and so will every single other descendent commit. Now you try to update GitHub, but you can't because the update can't be resolved against GitHub's commit network. The commit hashes no longer match. You have to push --force.

But after you push --force your changes, every other copy of the repository now contains commits that no longer match the upstream history. Every copy of every fork will fail on merge --ff-only. Every single copy of every single fork must be updated to match your new, different history. Probably manually.

Which can, in theory, be accomplished. If there are only a few copies of the repository, it may even be manageable. But with a successful public repository, with many forks, how will you ever coordinate that effort? Every single user will have to jump through hoops, manually merging origin/master, which has now dramatically diverged from master, just to accommodate your changes. Have you ever tried to merge two branches that diverged a long time ago? Not fun.

And if those users have work on the go, branches off of master with work that isn't merged in yet, they'll have to rebase that work on the new branch. And test it again. And make sure to coordinate those changes with everybody else dealing with your changes.

Basically, all work on all branches in all forks will have to be frozen until this process gets sorted out.

And even after all of that, your sensitive files have still been leaked. Some user may not delete his copies of those files, or may have copied your passwords or keys. The old files may be cached on a server somewhere. The information in those files is still compromised, and none of the work you've done changes that in any way.

A much, much better option is to admit whatever mistake you've made, introduce one new commit to reverse it, leaving all ancestor commits as they are, and move forward.

And the only way to deal with your leaked information is to change it.

Can you please go into more detail regarding the 'headaches'?
@mikewaters, I've added some detail to my answer. I hope it helps. If anything isn't clear, please let me know.

Collectives™ on Stack Overflow

Purging data from git history in a large project

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related