How do I remove a folder from Git’s history?

Asked

Viewed 14,753 times

18

I started a while ago to develop a web page, which is well organized, has folders for everything, example from the root of the repository/project:

account/  
products/  
js/  
css/  
img/   
data/ *(pdfs e zips para download)*  
index.php   
etc...  

The problem was that initially I had no idea of the problem that the binary files posed to the Git, and all files for download (pdfs and zips) on the site were added successively with several commits to the folder data.

Right now the repository has 600 Mb, and I know that if the date folder had not been added it would have been less than 10 Mb!

Is there any way erase permanently from all Git history a briefcase or by file types, pdf, zip?

  • related: https://answall.com/questions/485278/como-remover-um-arquivo-do-git-mas-o-manter-locally

3 answers

17

The answers are all right but... what is happening after all?

The git commands may sometimes not be very friendly, here’s an explanation more human.

On the basis of link script provided by @Guilherme:

git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD

git filter-branch --index-filter "[command]" HEAD

Here we’re going through revision by instructing git to run our command (which in case deletes the files).

  • git filter-branch: Rewrite history, review by revision, as the filters you specify.
  • --index-filter: This filter instructs git to apply the command directly in the repository without making a local copy, for git to make a local copy (i.e., make a checkout of the files) use --tree-filter. The advantage of --index-filter is that its execution is faster; the downside is that only git commands can be applied.
  • "[comando]": Here is the command git will apply for each repository revision.
  • HEAD: Indicates where git should start iterating by applying the command. It can be another specifier, such as an SHA1 or a tag. Just use git rev-list to see which revisions git will use (this is the command it uses internally).

git rm -Rf --cached --ignore-unmatch [files]

This is the deletion command of the files themselves, executed at each revision passed by filter-branch.

  • git rm: Removes the files from index and working copy (in our case, index).
  • -rf: -r of recursive, recursively removes directories. -f of force, forces the removal of the file even if there are local modifications; it will make no difference in our case, but it also does not hurt :-)
  • --cached: Makes the command work only on the Stage.
  • --ignore-unmatch: Returns no error (return of command 0) even if you do not find the files. Important because if the command used by filter-branch has a non-zero return git understands that there was an error and ends the execution.
  • [arquivos]: Path of files to be removed.

Done, after running the command the file does not appear in any revision of git history (git log). So the repository has decreased considerably in size? Not exactly; at least not in your local copy.

When reviewing files by deleting the files, git generated new revisions without the files, and swapping the old ones for new ones. What happens is that when git does this it doesn’t delete the revision completely from the repository, it just dereference (nor is there such a word né :p) these revisions: they continue to exist in the repository, even orphans. You can find out this by listing them as git reflog or even checking the size of the folder .git (which will continue to occupy a large space in your case).

To completely remove these revisions from git you should use the last (and forgotten) line of the script we use as an example:

rm -rf .git/refs/original/ && git reflog expire --expire=now --all &&  git gc --aggressive --prune=now

Come on.

rm -Rf . git/ref/original/

Erasing the backup done by filter-branch.

  • rm -rf: Shell command itself. Again -r of recursive, to delete subdirectories; -f of force, ignores non-existent files and makes no prompt with the user.
  • .git/refs/original/: This is the folder with the backup of references affected by the command filter-branch.

git reflog expire --expire=now --all

Dereferencing for once the orphaned git revisions.

The reflog is, in my opinion, one of the most poorly documented (and confused) commands in all of git.

  • git reflog: Similar to the command git log (contraction of Reference log), but also covering orphan reviews, stashes, etc..
  • expire: Expires revisions (removes your references from reflog, unaware of them by git).
  • --expire=now: Consider revisions from a date; now to apply in all independent of time.
  • --all: Causes the reflog be more comprehensive, passing on other branches and stashes.

git gc --aggressive --Prune=now

And finally erasing them.

  • git gc: git Garbage Collector. Remove files and compress objects from git
  • --aggressive: Causes git to perform optimization even if the command takes a while to execute.
  • --prune=now: Consider objects from a date. now to apply at all independent of time.

Remember, you rewrote your history

To apply your local history, now modified, to your remote you will have to force your push:

git push -f

Consider the effects of this rewrite for those who are also working on the same remote as you.


p s..

  • There is an adaptation of mine in the last line of the script that clears the historical repository. The difference is the inclusion of the parameter --expire=now and --prune=now. If you don’t use these parameters git takes a default time of 90 days and 2 weeks respectively, so it only works in your older revisions.
  • Github also has very similar tutorial.
  • It worked perfectly for me. In short (because the answer is quite extensive), make the command: git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD and then: rm -rf .git/refs/original/ && git reflog expire --expire=now --all && git gc --aggressive --prune=now

  • EXCELLENT RESPONSE

  • Great explanation. I’m still afraid to perform because I have more people working on the same project

17


It needs tests, but in this link the following command is mentioned to permanently remove files from history:

git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch nome_do_arquivo" HEAD
  • after answering I saw this... the problem is more complex. I found this here. But you’d have to test to be sure

  • The filter-branch is the way to do it yes. I would add the --prune-empty, that will remove commits that are "empty" (because the files they changed are no longer part of the repository).

  • Alias, remove your original answer and leave only the update. :)

  • I did the test and the repository was actually reduced to 60% of the original size. This was not the expected value, anyway it has improved ...

3

Adapting the command specified in Chapter. 9.7 of the book Pro Git:

$ git filter-branch --index-filter 'git rm -r --cached --ignore-unmatch nome-do-arquivo-ou-diretorio'

Note that since the command rewrites the history, you will need to download (clone) the repository again on all computers that already had a copy of the repository.

Explanation of the book:

The option --index-filter is similar to the option --tree-filter used in Chapter 6, except that instead of passing a command that modifies the files you have checked out on disk, you are modifying your selection area (staging area) or index.

Instead of removing a specific file with something like rm file, you have to remove it with git rm --cached — you should remove it from the index, not from the disk. The reason to do it this way is speed - because Git doesn’t need to check out every revision on disk before running its filter, the process can be much faster.

The option --ignore-unmatch of git rm tells him not to show errors if the pattern you are trying to remove is not there.

Your history no longer contains a reference to the file. However, your reflog is a new set of refs that git added when you filter-branch into . git/refs/original not yet, so you have to remove them and then repack the database. You need to get rid of anything that has a pointer to those old commits before you repack:

$ rm -Rf .git/refs/original

$ rm -Rf .git/logs/

$ git gc

  • 1

    For directories should be used ... 'git rm -r ...'. Some additional flags would be --prune-empty to remove any blank commit (if desired), --tag-name-filter cat to preserve any tag that exists in the repository and ends with -- --all to apply to all refs.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.