I've got a test repository that I put under Git. Most of the files are pretty tiny but there's a very large number of them and simply Git operations like add and status are taking tens of minutes to complete. What are my options for putting these under revision control and getting reasonable performance? Should I attempt to use submodules or should I steer clear of DVCSes?
2 Answers
Git operations like add and status require stating every file in the filesystem (to detect changes). Either you have a truly massive number of files (say, tens or hundreds of thousands of files), or you have a filesystem that has a rather slow stat operation.
In any case, if you need to work on a system where this is extremely slow, you can use the "assume unchanged" bit in the index, which tells Git not to bother stating the file. If you do turn this on, you need to manually instruct git to pick up changes in individual files, e.g. by passing them directly to git add, otherwise Git won't even know anything changed. You can turn this on by setting git config core.ignoreStat true and then running something like git reset --hard HEAD.
1 Comment
I wonder what is "very large" number here. Usually it's not the amount of small files that git finds troublesome but the big binary files. However, I can imagine that if the amount is large enough, you'd want to have them splitted into several repositories - either by means of submodules or some other way. If they need to reside in one single repo, you might find for example Subversion to be more performant.
EDIT: ok, so you added as comment that you use a NFS mount, which would sound like the possible bottleneck here. Please check for solutions on that in this thread. Especially core.preloadindex might be of interest here.
From the documentation:
core.preloadindex
Enable parallel index preload for operations like git diff
This can speed up operations like git diff and git status especially on filesystems like NFS that have weak caching semantics and thus relatively high IO latencies. With this set to true, git will do the index comparison to the filesystem data in parallel, allowing overlapping IO’s.
EDIT2: on the comments there was a mention of 6 million files. I can understand this becoming a bottleneck - that is indeed very large amount.
git resetoperation on a local SSD file system. The repo is rather large – Cocoapods local spec.