Table of Contents
Introduction
Over time, Git repositories can accumulate unnecessary files, large binaries, sensitive data, and messy commit histories. This guide provides comprehensive techniques for cleaning up repositories, removing sensitive information, and optimizing repository performance.
BFG Repo-Cleaner Method
BFG Repo-Cleaner is a faster, simpler alternative to git-filter-branch for cleansing bad data from Git repository history.
Installation
# Download BFGwget https://repo1.maven.org/maven2/com/madgag/bfg/1.14.0/bfg-1.14.0.jar
# Or using Homebrew on macOSbrew install bfg
# Or using package manager on Linuxsudo apt-get install bfg # Debian/Ubuntusudo yum install bfg # RHEL/CentOS
Removing Files by Name
# Clone a fresh copy of your repo using --mirrorgit clone --mirror git://example.com/my-repo.git
# Remove all files named 'id_rsa' or 'id_dsa'java -jar bfg.jar --delete-files id_{dsa,rsa} my-repo.git
# Remove all files with .log extensionjava -jar bfg.jar --delete-files '*.log' my-repo.git
# Clean up the reflog and garbage collectcd my-repo.gitgit reflog expire --expire=now --all && git gc --prune=now --aggressive
# Push the cleaned repositorygit push
Removing Large Files
# Remove all blobs bigger than 100Mjava -jar bfg.jar --strip-blobs-bigger-than 100M my-repo.git
# Remove the 20 largest filesjava -jar bfg.jar --strip-biggest-blobs 20 my-repo.git
# Clean upcd my-repo.gitgit reflog expire --expire=now --all && git gc --prune=now --aggressivegit push
Removing Sensitive Data
# Create a file with sensitive strings to removeecho 'PASSWORD=secretpass123' >> passwords.txtecho 'API_KEY=abcd1234efgh5678' >> passwords.txtecho 'AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' >> passwords.txt
# Replace sensitive strings with ***REMOVED***java -jar bfg.jar --replace-text passwords.txt my-repo.git
# Clean upcd my-repo.gitgit reflog expire --expire=now --all && git gc --prune=now --aggressivegit push --force
git filter-branch Method
While slower than BFG, git filter-branch is built into Git and offers more flexibility.
Removing a File from All History
# Remove a file from all commitsgit filter-branch --force --index-filter \ 'git rm --cached --ignore-unmatch path/to/sensitive-file' \ --prune-empty --tag-name-filter cat -- --all
# Remove the original refs backed up by filter-branchgit for-each-ref --format="delete %(refname)" refs/original | git update-ref --stdin
# Expire all reflog entriesgit reflog expire --expire=now --all
# Garbage collectgit gc --prune=now --aggressive
# Force push to remotegit push origin --force --allgit push origin --force --tags
Removing a Directory
# Remove an entire directory from historygit filter-branch --force --index-filter \ 'git rm -r --cached --ignore-unmatch path/to/directory' \ --prune-empty --tag-name-filter cat -- --all
# Clean upgit for-each-ref --format="delete %(refname)" refs/original | git update-ref --stdingit reflog expire --expire=now --allgit gc --prune=now --aggressive
Changing Author Information
#!/bin/sh# Script to change author information
git filter-branch --env-filter 'OLD_EMAIL="wrong@example.com"CORRECT_NAME="Correct Name"CORRECT_EMAIL="correct@example.com"
if [ "$GIT_COMMITTER_EMAIL" = "$OLD_EMAIL" ]then export GIT_COMMITTER_NAME="$CORRECT_NAME" export GIT_COMMITTER_EMAIL="$CORRECT_EMAIL"fiif [ "$GIT_AUTHOR_EMAIL" = "$OLD_EMAIL" ]then export GIT_AUTHOR_NAME="$CORRECT_NAME" export GIT_AUTHOR_EMAIL="$CORRECT_EMAIL"fi' --tag-name-filter cat -- --branches --tags
git filter-repo Method (Recommended)
git filter-repo is the officially recommended tool by the Git project for repository filtering.
Installation
# Using pippip install git-filter-repo
# Or download directlywget https://raw.githubusercontent.com/newren/git-filter-repo/main/git-filter-repochmod +x git-filter-reposudo mv git-filter-repo /usr/local/bin/
Basic Usage
# Remove a filegit filter-repo --path sensitive-file.txt --invert-paths
# Remove a directorygit filter-repo --path secret-directory/ --invert-paths
# Remove multiple pathsgit filter-repo --path file1.txt --path dir1/ --path dir2/ --invert-paths
# Keep only specific pathsgit filter-repo --path src/ --path docs/
Advanced Filtering
# Remove files by patterngit filter-repo --filename-callback ' if filename.endswith(b".log"): return None return filename'
# Remove large filesgit filter-repo --strip-blobs-bigger-than 10M
# Replace text in all filesecho 'PASSWORD==>***REMOVED***' > expressions.txtecho 'regex:password\s*=\s*["'\'']*[^"'\'']+["'\'']*==>password = "***REMOVED***"' >> expressions.txtgit filter-repo --replace-text expressions.txt
Finding Large Files in Git History
Script to Find Large Objects
#!/bin/bash# find-large-files.sh - Find the largest objects in Git history
# Set the number of largest files to showNUM_FILES=20
# Find large objectsgit rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | tail -n $NUM_FILES | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
Using git-sizer
# Install git-sizerwget https://github.com/github/git-sizer/releases/download/v1.5.0/git-sizer-1.5.0-linux-amd64.zipunzip git-sizer-1.5.0-linux-amd64.zipsudo mv git-sizer /usr/local/bin/
# Analyze repositorygit-sizer --verbose
# Output in JSON formatgit-sizer --json > git-sizer-report.json
Cleaning Git History
Squashing Commits
# Interactive rebase to squash last N commitsgit rebase -i HEAD~N
# In the editor, change 'pick' to 'squash' or 's' for commits to squash# Example:# pick abc1234 First commit# squash def5678 Second commit# squash ghi9012 Third commit
# For squashing all commits into onegit rebase -i --root
Cleaning Up Merge Commits
# Rebase to remove merge commitsgit rebase -i --rebase-merges HEAD~20
# Or to completely linearize historygit rebase -i HEAD~20
Removing Empty Commits
# Remove commits that don't change anythinggit filter-branch --prune-empty --tag-name-filter cat -- --all
# Using git filter-repo (better)git filter-repo --prune-empty always
Repository Maintenance
Garbage Collection
# Run garbage collectiongit gc
# Aggressive garbage collection (takes longer but more thorough)git gc --aggressive --prune=now
# Verify repository integritygit fsck --full
# Clean unnecessary files and optimize repositorygit clean -fdgit remote prune origingit repack -a -d --depth=250 --window=250
Pruning Objects
# Prune all unreachable objectsgit prune --expire=now
# Prune reflog entries older than 30 daysgit reflog expire --expire=30.days --all
# Prune remote-tracking branchesgit remote prune origin
# Remove stale branchesgit for-each-ref --format '%(refname:short)' refs/heads | grep -v master | xargs git branch -D
Repository Statistics
#!/bin/bash# repo-stats.sh - Display repository statistics
echo "Repository Size:"du -sh .git
echo -e "\nObject Count:"git count-objects -v
echo -e "\nLargest Files in Working Directory:"find . -type f -not -path './.git/*' -exec du -h {} + | sort -rh | head -20
echo -e "\nBranch Count:"git branch -a | wc -l
echo -e "\nCommit Count:"git rev-list --all --count
echo -e "\nContributors:"git shortlog -sn
Removing Sensitive Data Patterns
Common Patterns to Remove
# Create patterns filecat > sensitive-patterns.txt << 'EOF'# AWS Credentialsregex:AKIA[0-9A-Z]{16}==>[AWS_ACCESS_KEY_REMOVED]regex:aws_secret_access_key\s*=\s*["']?[A-Za-z0-9/+=]{40}["']?==>aws_secret_access_key="[REMOVED]"
# API Keysregex:api[_-]?key\s*[:=]\s*["']?[A-Za-z0-9]{32,}["']?==>api_key="[REMOVED]"regex:token\s*[:=]\s*["']?[A-Za-z0-9]{32,}["']?==>token="[REMOVED]"
# Passwordsregex:password\s*[:=]\s*["']?[^"'\s]+["']?==>password="[REMOVED]"regex:passwd\s*[:=]\s*["']?[^"'\s]+["']?==>passwd="[REMOVED]"
# Private Keysregex:-----BEGIN [A-Z ]*PRIVATE KEY-----[\s\S]*?-----END [A-Z ]*PRIVATE KEY-----==>-----BEGIN PRIVATE KEY-----[REMOVED]-----END PRIVATE KEY-----
# Database URLsregex:postgres://[^@]+@[^/]+/\w+==>postgres://[REMOVED]@[REMOVED]/[REMOVED]regex:mysql://[^@]+@[^/]+/\w+==>mysql://[REMOVED]@[REMOVED]/[REMOVED]regex:mongodb://[^@]+@[^/]+/\w+==>mongodb://[REMOVED]@[REMOVED]/[REMOVED]EOF
# Apply patterns with git filter-repogit filter-repo --replace-text sensitive-patterns.txt
Verification Script
#!/bin/bash# verify-sensitive-data.sh - Check for sensitive data in repository
echo "Checking for potential sensitive data..."
# Check for private keysecho -e "\nSearching for private keys:"git grep -E "BEGIN (RSA|DSA|EC|OPENSSH) PRIVATE KEY" || echo "No private keys found"
# Check for AWS credentialsecho -e "\nSearching for AWS credentials:"git grep -E "AKIA[0-9A-Z]{16}" || echo "No AWS access keys found"
# Check for passwordsecho -e "\nSearching for passwords:"git grep -iE "password\s*[:=]" | grep -v -E "(example|test|dummy|placeholder)" || echo "No passwords found"
# Check for API keysecho -e "\nSearching for API keys:"git grep -iE "(api[_-]?key|token)\s*[:=]" | grep -v -E "(example|test|dummy|placeholder)" || echo "No API keys found"
# Check for large filesecho -e "\nLarge files in current working tree:"find . -type f -size +10M -not -path './.git/*' -exec ls -lh {} \;
Best Practices
Pre-Cleanup Checklist
#!/bin/bashecho "Git Repository Cleanup Pre-Check"echo "================================"
# Check if repository is cleanif [[ -n $(git status -s) ]]; then echo "❌ Working directory is not clean. Commit or stash changes first." exit 1else echo "✅ Working directory is clean"fi
# Check for backupecho -e "\n⚠️ Have you backed up your repository? (y/n)"read -r responseif [[ ! "$response" =~ ^([yY][eE][sS]|[yY])$ ]]; then echo "Please backup your repository first!" echo "Run: git clone --mirror $(git remote get-url origin) backup-$(date +%Y%m%d)" exit 1fi
# List remotesecho -e "\n📡 Remote repositories:"git remote -v
# Show repository sizeecho -e "\n📊 Repository size:"du -sh .git
# Count objectsecho -e "\n🔢 Object count:"git count-objects -v
echo -e "\n✅ Pre-cleanup check complete. Safe to proceed."
.gitignore Templates
# Common patterns to add to .gitignore before cleanup
# Sensitive files*.pem*.key*.pfx*.p12.env.env.*config/secrets.ymlconfig/database.yml
# Large files*.log*.sql*.dump*.tar*.tar.gz*.zip*.7z*.rar
# OS generated files.DS_StoreThumbs.db*.swp*~
# IDE files.idea/.vscode/*.iml.project.settings/
# Build artifactsbuild/dist/*.egg-info/__pycache__/node_modules/target/*.class*.jar
Post-Cleanup Steps
#!/bin/bash# post-cleanup.sh - Steps to perform after cleanup
echo "Post-Cleanup Tasks"echo "=================="
# Update all branchesecho "\n1. Updating all branches..."for branch in $(git branch -r | grep -v HEAD | sed 's/origin\///')do echo "Updating branch: $branch" git checkout $branch git pull origin $branchdone
# Notify team membersecho -e "\n2. Team notification template:"cat << EOFSubject: Important: Git Repository Cleanup Completed
Team,
We have completed a cleanup of our Git repository. This cleanup included:- Removing large files from history- Removing sensitive data- Optimizing repository size
ACTION REQUIRED:1. Delete your local repository2. Re-clone the repository: git clone [repository-url]3. If you have local branches, you'll need to recreate them4. Update any CI/CD pipelines that might be affected
The repository size has been reduced from X GB to Y GB.
Please complete these steps before continuing work.
Thank you for your cooperation.EOF
# Verify cleanupecho -e "\n3. Verification:"echo "New repository size: $(du -sh .git | cut -f1)"echo "Object count: $(git count-objects -v | grep 'count:' | cut -d' ' -f2)"
# Update documentationecho -e "\n4. Update documentation:"echo "- Update README with new clone instructions"echo "- Document any changed procedures"echo "- Update CI/CD configurations if needed"
Security Considerations
Rotating Compromised Credentials
After removing sensitive data from Git history:
- Immediately rotate all exposed credentials
- Audit access logs for any unauthorized usage
- Update all systems using the compromised credentials
- Implement secret scanning in CI/CD pipeline
Preventing Future Issues
# Install pre-commit hookspip install pre-commit
# Create .pre-commit-config.yamlcat > .pre-commit-config.yaml << 'EOF'repos: - repo: https://github.com/Yelp/detect-secrets rev: v1.4.0 hooks: - id: detect-secrets args: ['--baseline', '.secrets.baseline'] - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.4.0 hooks: - id: check-added-large-files args: ['--maxkb=1000'] - id: check-merge-conflict - id: check-yaml - id: end-of-file-fixer - id: trailing-whitespaceEOF
# Install hookspre-commit install
# Create baseline for existing secrets (to ignore)detect-secrets scan > .secrets.baseline
Conclusion
Regular repository maintenance is crucial for keeping Git repositories fast, secure, and manageable. Key takeaways:
- Always backup before performing destructive operations
- Use BFG or git-filter-repo instead of filter-branch when possible
- Rotate credentials immediately after removing them from history
- Implement preventive measures like pre-commit hooks
- Communicate with your team before and after cleanup
- Document the process for future reference
Remember that rewriting history affects all collaborators, so coordinate carefully and ensure everyone is prepared for the changes.