Git push new branch with same files, uploads all files again

Question

Given following scenario.

Create new branch
Commit a 10MB file
Git push (uploads the 10MB file)
Create a new branch (orphan)
Commit the same 10mb file (no changes made, same object sha hash)
Git push uploads the 10MB file AGAIN

My expectations are, that the already uploaded files won't be uploaded again using git push. But what actual happens is that when a new branch is made all files (even when thousands of smaller source files, instead of one 10MB file) will be uploaded again and again.

My question: How can I make it that Git detects that the 10mb file is already uploaded? Do you know a workaround/fix to make Git detecting already existing objects on the server when pushing commits? Git detects files by its sha, so it should be able to detect that some files in the tree of the commit are already present on the server.

Possible use-case: I have two completely different branches, but some common files are shared within those two. When I push one branch, I don't want to upload the common files again when I push the second branch.

Actual use-case: I do a lot of machine learning experiments using Python scripts and some smaller datasets (1MB - 10MB). Every time I start an experiment, I add all necessary experiment files to a new Git tree, and use that tree in a new commit without branching. That commits hangs completely free in the air and gets then referenced with a new Git reference (e.g. refs/jobs/my-experiment-name). When I now have two experiments with almost the same files (and thus two references), Git pushes all objects again when I push those references. I have low bandwidth and this really slows down my work.

$ mkdir git-test && cd git-test $ git init $ git remote add origin [email protected]:username/projectname.git # create dummy 10MB file $ head -c 10000000 /dev/urandom > dummy $ git add dummy $ git commit -m 'init' # first push, uploads everything - makes sense $ git push origin master Counting objects: 3, done. Delta compression using up to 6 threads. Compressing objects: 100% (2/2), done. Writing objects: 100% (3/3), 9.54 MiB | 1.13 MiB/s, done. Total 3 (delta 0), reused 0 (delta 0) # create new empty branch, not based from master $ git checkout --orphan branch2 # add same files again $ git add dummy $ git commit -m 'init on branch2' # this uploads now again the dummy file (10MB), although the server # has that object alread $ git push origin branch3 Counting objects: 3, done. Delta compression using up to 6 threads. Compressing objects: 100% (2/2), done. Writing objects: 100% (3/3), 9.54 MiB | 838.00 KiB/s, done.

On the technical side we have:

Two commits that do not share the same parents (have completely different history)
Those two commits have the exact same tree sha id (and thus reference the same object files)
Pushing both commits results in transferring all the objects in the same tree twice. Although I expect either that Git detects that the tree in the second commit is already present OR that file objects within that tree are already on the server.

Answer (I can't answer anymore, since someone marked this as duplicate).

The solution is unfortunately not that simple.

Every time Git wants to sync two repositories it builds a pack file, that contains all objects necessary (like files, commits, trees). When you execute a git push, the remote sends all existing references (branches) and its head commit SHA to the client. This is the problem: The pack protocol is not meant to be used per-object, but per-commit. So, according to the protocol itself, the explained behaviour above is correct. To work around that, I built a simple script every one can use to do a git push based on objects, instead of commits.

You find it here: https://github.com/marcj/git-objects-sync

What it does:

Takes one commit (only one, you need to execute it on every unsynced parent commit as well) and builds a list of object SHAs (files, trees, commits) that belong to that commit (except parent commit).
Sends this list to the server, servers answers back SHAs of objects it does not have yet
Client builds a pack file based on the missing object SHAs and sends it to the server with the information which ref needs to be updated to which commit.
Server receives pack file, unpacks it and updates the ref with given commit SHA.

Of course this has some drawbacks, but I described them in the linked Github repository.

With my script above you get now following:

marc@osx ~/git-test (branch11*) $ # added new branch11 as explained at the very top marc@osx ~/git-test (branch11*) $ python git-sync.py refs/heads/branch11 Counting objects: 1, done. Writing objects: 100% (1/1), 158 bytes | 158.00 KiB/s, done. Total 1 (delta 0), reused 0 (delta 0) marc@osx ~/git-test (branch11*) $ git push origin branch11 Everything up-to-date

So as you see, it only syncs one object (the commit object), and not the dummy file and its tree object again.

I think the real question here is why are you committing the same thing to multiple branches? I can't wrap my head around why you would do that — Joe Phillips
– Joe Phillips, Commented Jan 12, 2018 at 14:40
Also your explanation of your process doesn't make sense. What are you branching from in step 4? If it's off of the head of the branch you previously pushed, what you're saying should be impossible. Please be more clear in your example — Joe Phillips
– Joe Phillips, Commented Jan 12, 2018 at 14:41
@JoePhillips he is not branching --orphan, no no wonder this file is not in git :D The question is why does he need this flag — Rumid
– Rumid, Commented Jan 12, 2018 at 14:42
Possible use-case: I have two completely different branches, but some common files are shared within those two. When I push one branch, I don't want to upload the common files again when I push the second branch. Also, Git detects files by its sha, so it should be able to detect that some files in the tree of the commit are already present on the server. — MMM
– MMM, Commented Jan 12, 2018 at 14:43
This is a dup of Could Git Smart API thin pack calculation ever consider reusing common sub-trees?: you've got an extremely low-frequency corner case and avoiding the downside here would mean paying measurable, noticeable overhead every. single. fetch. It's a bad trade. — jthill
– jthill, Commented Jan 14, 2018 at 19:27

LightBender · Accepted Answer · 2018-01-12 15:22:01Z

1

Git develops its manifests and packs up objects using only an exchange of references. If the branch you are pushing has no common ancestors with the remote, it's going to pack and re-upload all the objects reachable from that branch.

The sequence goes something like this

You: send list of your current remote refs to remote
Remote: send all objects not reachable from that list to client and a list of new refs
You: Add objects to your local database and update remote refs
You: find all commits on your branch not reachable from any remote ref (No common ancestor, all commits on branch will be sent)
You: build a manifest out of the diffs for those commits (finds 10mb file)
You: pack and send to remote

The remote will recognize objects already in its database and use the existing objects.

The logic is that sending a few unnecessary files from time to time is far more efficient than walking the entire history (potentially tens or hundreds of thousands of objects being accessed and compared) on every push.

answered Jan 12, 2018 at 15:22

LightBender

4,3431 gold badge18 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

torek Over a year ago

Note that Git could use have/want exchanges on all objects, rather than just on commits. In this case it would find that the large files already exist (by hash ID), but that would require a lot more data-exchange in the "slow part" of the fetch or push transaction.

MMM Over a year ago

@torekt: That's not true. It's always less data exchange than in the current version. Current version: uploads all sha ids AND its content of all files. Different version: Uploads all sha-ids and returns back which are not on the server already, then sends the content of missing objects only.I don't see how sending all local git sha ids only would require more data-exchange than all sha ids + its actual content.

LightBender Over a year ago

@MarcJ.Schmidt Keep in mind that your use case is not what git is designed to do. Git is optimized for branching, merging, and tracking a single code base over a long period of time. Under normal circumstances, yes, it would require transmitting fewer actual bytes, but the few hundred milliseconds it takes to connect, wait for the server to filter and respond, and build a second manifest would be an order of magnitude longer than just pushing all the data. You have created an edge case where this is not true, but fixing it would degrade the performance of the primary use case.

MMM Over a year ago

@LightBender I disagree. Git is not only optimized for what you said, but also for efficient file storage. The very heart of Git. I want to utilize this and its internal model with commits, tree, file objects, references is perfectly fine for my use-case. The only thing where it isn't perfect when it comes to pushing changes, since Git uses only the direct parent commits history to build the delta pack.

eddiemoya Over a year ago

@MarcJ.Schmidt You keep insisting that git is optimized for what you want to do, but it's clearly stated purpose isn't file management, and everyone is trying to explain to you that while it maybe good at some aspects of what you want - its not optimized for your use case. Specifically with regard to the file transfers. There have been multiple explanations of ways to squeeze parts of what you want out of git - but your not going to get a perfect solution out of the box. Your going to need to sacrifice workflow or project structure, or something else. Or write your own git plugin to do this.

eddiemoya · Accepted Answer · 2018-01-17 17:35:15Z

I think you just need to stop using --orphan to create new experiment branches.

Workflow

Create your initial project.
Add and commit your core/common files to the master branch
Create all the non-orphaned branches you want for every experiment. Create them based on the master branch.

That's it.

What's going on?

You have insisted that you aren't using branches and that you are only using references. However, branches are a kind of reference. Moreover, git checkout --orphan <newthing> does actually create a branch. The trouble is that its a branch that doesn't know about anything that was previously added to the repository because it has no parents. It's essentially the same thing as having created a whole new repository.

If you create new branches with git checkout -b <newthing> master, then git will not bother uploaded files that were already in master.

How do you manage new common files now?

Let's say someday you have a new file which you want all future experiments to make use - of a new shared/common file. All you would need to do is add that file to master and create your next experiment branch based on the updated master branch. If want that file to be available to your existing/previously created experiments, you would just need to checkout those branches and run git pull --rebase origin master. This would pull in the commits you added to master, which would contain the newly added file(s).

Mounting Complexity

When you start doing pulls, things might start getting complicated. There are a couple different strategies for how to update branches, and using --rebase is one of those strategies. It's not required, but it's probably the better way to go. There additional things to consider such as how to manage conflicting changes, but those are seemingly outside the scope of this question. There are plenty of resources available to explain rebasing/merging etc.

TR;DR

Don't try to manage commit-trees and parent/child relationships manually. Just let git do its thing.

FelipeC · Accepted Answer · 2018-01-18 02:04:45Z

As others have mentioned; Git will only check for blobs on the branch you are pushing, but you can trick Git to check for blobs in the master branch by including it in your ancestry.

It seems you really want to work on an orphan branch, so you can merge the master branch only when you want to push. You can ignore the whole contents of the master branch by using the ours strategy.

 % git checkout --orphan branch2 % git rm -rf . % git checkout master dummy % git commit -m 'Init on branch2' % git merge --strategy=ours --allow-unrelated-histories master -m 'Fake merge' % git push origin branch2 Counting objects: 3, done. Delta compression using up to 4 threads. Compressing objects: 100% (2/2), done. Writing objects: 100% (3/3), 336 bytes | 336.00 KiB/s, done. Total 3 (delta 1), reused 0 (delta 0) remote: Resolving deltas: 100% (1/1), done. To github.com:felipec/bounty-test.git * [new branch] branch2 -> branch2

The contents are exactly the same:

 % git rev-parse branch2: d0f549d94dbba116d782293722cf9b43e8a67819 % git rev-parse branch2^: d0f549d94dbba116d782293722cf9b43e8a67819

If you don't want to mess with your original branch, you can create a new branch just for pushing.

Also, you can just start off based of master, and just throw away all the files:

 % git checkout -b branch3 master % git rm -rf . % git checkout master dummy % git commit -m 'Init on branch3' % git push origin branch3 Counting objects: 2, done. Delta compression using up to 4 threads. Compressing objects: 100% (1/1), done. Writing objects: 100% (2/2), 236 bytes | 236.00 KiB/s, done. Total 2 (delta 0), reused 0 (delta 0) To github.com:felipec/bounty-test.git * [new branch] branch3 -> branch3

If you really want this rather specific use-case to be handled properly in Git you might want to contact the developers on the mailing list. They might offer you other alternatives, but it's possible that they won't agree there's something that can be improved in the code without significant trade offs for other cases.

Note: I don't know why you had to add the dummy file in your steps, on my side the whole contents of the master branch are staged when I do git checkout --orphan.

Gonzalo Matheu · Accepted Answer · 2018-01-14 22:04:49Z

It might be better to use Git LFS in your scenario. Git LFS is made for managing large files on git repositories.

As per this github issue (which has been resolved ~2 years ago), duplicate files are managed as a single entity with the same OID. Thus they are pushes only once and fetched just once.

At least Github and Gitlab already include support for git-lfs in all the repositories.

Collectives™ on Stack Overflow

Git push new branch with same files, uploads all files again

Answer (I can't answer anymore, since someone marked this as duplicate).

4 Answers 4

5 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Answer (I can't answer anymore, since someone marked this as duplicate).

4 Answers 4

5 Comments

Comments

Comments

Comments

Linked

Related