Hero image for Demystifying Git's Object Database: A Hands-On Exploration of Blobs, Trees, and Commits

Demystifying Git's Object Database: A Hands-On Exploration of Blobs, Trees, and Commits


You’ve rebased, cherry-picked, and force-pushed your way through complex merges—but when git fsck reports a dangling blob or git gc refuses to clean up space, you’re suddenly debugging a system you don’t fully understand. Let’s fix that by exploring Git’s object model from the inside out.

Most developers treat Git as a black box. They memorize commands, follow workflows, and occasionally panic when something goes wrong. But Git isn’t magic—it’s a beautifully designed content-addressable filesystem with a version control interface layered on top. Once you understand how Git actually stores and retrieves data, commands that seemed mysterious become obvious, and recovery from mistakes becomes straightforward rather than terrifying.

In this guide, we’ll peel back the layers of abstraction and examine Git’s internals directly. We’ll read raw objects, write our own using Python, and develop the mental model that separates Git experts from Git users. By the end, you’ll view your .git directory not as a mysterious folder to avoid, but as a well-organized database you can inspect, manipulate, and repair.

The Content-Addressable Filesystem Under .git

When you initialize a Git repository, you’re creating something far more sophisticated than a simple version control system. You’re instantiating a content-addressable filesystem—a database where every piece of data is indexed by a cryptographic hash of its contents.

The .git/objects directory serves as this database’s storage layer. Every file you’ve ever committed, every directory structure, every commit message—all of it lives here as discrete objects, each identified by a 40-character SHA-1 hash. This hash isn’t arbitrary; it’s computed from the object’s content, which means identical content always produces identical hashes.

This design choice represents a fundamental departure from how traditional filesystems work. In a normal filesystem, files are identified by their path—/home/user/project/file.txt. The path is independent of the content; you can change the file’s contents without changing its identity. Git inverts this relationship entirely: the content determines the identity. Change a single byte, and you have a completely different object with a completely different hash.

This design decision has profound implications. When you add the same file to two different branches, Git stores it exactly once. When you copy a 10MB file and commit both copies, your repository grows by approximately 10MB, not 20MB. The content-addressable nature provides automatic deduplication at no additional cost. This happens transparently—Git doesn’t scan for duplicates or maintain a deduplication index. The mathematics of hashing guarantee that identical content maps to identical locations.

Consider what this means for a large monorepo with thousands of developers. When multiple teams add the same library file, the same license text, or the same configuration template, Git stores each unique piece of content exactly once. The storage savings compound dramatically at scale.

The directory structure inside .git/objects uses the first two characters of the hash as a subdirectory name, with the remaining 38 characters forming the filename. This fan-out structure prevents any single directory from containing too many files—a practical consideration for filesystem performance. A blob with hash a1b2c3d4e5f6... lives at .git/objects/a1/b2c3d4e5f6.... Without this fan-out, a repository with millions of objects would have a single directory with millions of entries, causing severe performance degradation on most filesystems.

explore-objects-directory.sh
# Initialize a new repository and examine the objects directory
git init example-repo && cd example-repo
# Initially empty (almost)
ls -la .git/objects/
# Output: info/ pack/
# Create and add a file
echo "Hello, World!" > hello.txt
git add hello.txt
# Now we have a loose object
find .git/objects -type f
# Output: .git/objects/8a/b686eafeb1f44702738c8b0f24f2567c36da6d

Four types of objects populate this database: blobs store file contents, trees represent directory structures, commits tie everything together with metadata and history, and tags provide named references to specific commits. Every operation you perform—branching, merging, rebasing—manipulates these four object types through well-defined transformations. There are no special cases, no hidden state, no magic. Everything reduces to operations on these four object types.

Understanding this storage model transforms how you think about Git operations. A branch isn’t a copy of your code; it’s a pointer to a commit object. A merge doesn’t duplicate files; it creates a new commit pointing to a new tree that references existing blobs. The efficiency and elegance of Git’s data model make its powerful features possible. When you realize that creating a branch is just writing a 41-byte file (a 40-character SHA-1 plus a newline), you understand why Git branches are so lightweight compared to other version control systems.

This content-addressable design also provides data integrity guarantees. If a single bit flips in any object—whether from disk corruption, memory errors, or network transmission problems—its hash no longer matches, and Git detects the corruption. The hash serves simultaneously as an identifier, a deduplication key, and a checksum. This triple duty makes the design remarkably efficient while providing guarantees that filesystem-based version control cannot offer.

Anatomy of Git Objects: Blobs, Trees, and Commits

Let’s examine each object type by inspecting real objects in a repository. The git cat-file command serves as your microscope into the object database. This command is your primary diagnostic tool—learn it well, and the internals of any Git operation become transparent.

Blobs store raw file content with no metadata—no filename, no permissions, nothing but bytes. When you stage a file, Git compresses its content and stores it as a blob. This separation of content from metadata might seem strange at first, but it enables powerful features that would otherwise be impossible.

inspect-blob.sh
# Create a simple file and add it to Git
echo "Hello, Git internals!" > greeting.txt
git add greeting.txt
# Find the blob hash
git ls-files --stage
# Output: 100644 a5c19b6a2e... 0 greeting.txt
# Inspect the blob content
git cat-file -p a5c19b6a2e
# Output: Hello, Git internals!
# Check the object type
git cat-file -t a5c19b6a2e
# Output: blob
# Check the object size in bytes
git cat-file -s a5c19b6a2e
# Output: 22

Notice that the blob contains only “Hello, Git internals!”—the filename greeting.txt exists nowhere in this object. This separation enables Git’s efficient handling of renamed files: rename a file, and Git stores no new content, just a new tree pointing to the same blob. Git’s rename detection works by comparing blob hashes—if two files in different commits share the same hash, Git knows they have identical content regardless of their names.

This design also explains why Git doesn’t truly track empty directories. A tree can only contain entries that reference objects, and there’s no object type for “empty directory.” Workarounds like adding .gitkeep files exist precisely because the content-addressable model requires content to exist.

Trees map names to blobs (and other trees), recreating your directory structure. Each tree entry contains a mode, object type, SHA-1 hash, and name. Trees are the glue that gives your content structure—without them, you’d have a pile of anonymous blobs with no way to reconstruct your project.

inspect-tree.sh
# After committing, inspect the tree
git cat-file -p HEAD^{tree}
# Output:
# 100644 blob a5c19b6a2e... greeting.txt
# 040000 tree b2d3f4a5c6... src
# Dive into a subdirectory tree
git cat-file -p b2d3f4a5c6
# Output:
# 100644 blob c3e4f5a6b7... main.py
# 100644 blob d4f5e6a7b8... utils.py
# Trees can nest arbitrarily deep
git cat-file -p HEAD^{tree}:src/lib
# Shows contents of src/lib/ directory

The mode field encodes file type and permissions: 100644 indicates a regular file, 100755 an executable, 040000 a subdirectory, and 120000 a symbolic link. These modes come directly from Unix filesystem conventions, though Git only tracks a subset of possible permissions. Notably, Git doesn’t preserve the full Unix permission bits—just whether a file is executable or not.

Understanding tree structure helps explain Git’s behavior with large directory renames. When you rename a directory, Git must create a new tree object (with the new name pointing to the same subtree), and new tree objects for every parent directory up to the root. The subtree itself—and all blobs within it—remains unchanged. This is why large directory renames are essentially instant despite affecting many “files.”

Commits bind trees to history. Each commit contains a tree reference, zero or more parent references, author information, committer information, and a message. The commit object is where Git’s version control nature emerges from its content-addressable foundation.

inspect-commit.sh
# Examine a commit object
git cat-file -p HEAD
# Output:
# tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
# parent 8a7b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b
# author Tim Derzhavets <[email protected]> 1707500000 +0000
# committer Tim Derzhavets <[email protected]> 1707500000 +0000
#
# Add greeting functionality
# Examine a merge commit (has multiple parents)
git cat-file -p <merge-commit-sha>
# Output:
# tree ...
# parent <first-parent-sha>
# parent <second-parent-sha>
# author ...

The distinction between author and committer matters in workflows involving patches. When you apply someone else’s patch, they’re the author (they wrote the code), but you’re the committer (you added it to the repository). Commands like git format-patch and git am preserve this distinction. The timestamps also differ independently—author timestamp reflects when code was originally written, committer timestamp reflects when it entered this repository.

The parent line creates Git’s history graph. Initial commits have no parent, regular commits have one, and merge commits have two or more. This linked structure enables Git to traverse history efficiently—but it also means that changing any commit’s content changes its hash, which changes any child commit’s parent reference, which changes that commit’s hash, rippling forward through all history. This is why rebasing “rewrites history”—it literally creates new commit objects with different hashes.

Tags come in two forms: lightweight tags are simple refs (covered in the next section), while annotated tags are full objects containing a target reference, tagger information, and a message. Annotated tags support GPG signatures for release verification.

inspect-tag.sh
# Create an annotated tag
git tag -a v1.0.0 -m "Release version 1.0.0"
# Find and inspect the tag object
git cat-file -p v1.0.0
# Output:
# object 8a7b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b
# type commit
# tag v1.0.0
# tagger Tim Derzhavets <[email protected]> 1707500000 +0000
#
# Release version 1.0.0

Pro Tip: Use git cat-file -p liberally when debugging. It works on any object type and formats the output readably. Combine it with git rev-parse to resolve any reference to its SHA-1: git cat-file -p $(git rev-parse HEAD~3).

Building Git Objects from Scratch with Python

Theory solidifies into understanding when you implement it yourself. Let’s write Python code that creates Git objects exactly as Git does. This exercise isn’t just academic—understanding the object format enables custom tooling, migration scripts, and deep debugging capabilities.

A Git object follows a specific format: a header containing the object type, a space, the content length in bytes, a null byte, and then the raw content. This entire payload gets zlib-compressed before storage. The format is simple enough to implement in any language, which partly explains Git’s success—clients and servers can be implemented independently while remaining interoperable.

git_objects.py
import hashlib
import zlib
import os
from typing import Tuple
def create_blob(content: bytes, git_dir: str = ".git") -> str:
"""
Create a Git blob object from content bytes.
Returns the SHA-1 hash of the created object.
The blob format is: "blob <size>\0<content>"
This gets zlib-compressed before storage.
"""
# Construct the blob header
# The header format: "type size\0" where \0 is a null byte
header = f"blob {len(content)}\x00".encode("ascii")
# Combine header and content to form the complete object
store = header + content
# Compute SHA-1 hash of the full object (header + content)
# This is the object's identity in Git's database
sha1 = hashlib.sha1(store).hexdigest()
# Compress using zlib (Git uses default compression level)
compressed = zlib.compress(store)
# Determine storage path using fan-out structure
# First 2 chars become directory, remaining 38 chars become filename
object_dir = os.path.join(git_dir, "objects", sha1[:2])
object_path = os.path.join(object_dir, sha1[2:])
# Create directory if needed
os.makedirs(object_dir, exist_ok=True)
# Write the compressed object atomically
# Only write if it doesn't exist (content-addressable guarantee)
if not os.path.exists(object_path):
# Write to temp file first, then rename for atomicity
temp_path = object_path + ".tmp"
with open(temp_path, "wb") as f:
f.write(compressed)
os.rename(temp_path, object_path)
return sha1
def read_object(sha1: str, git_dir: str = ".git") -> Tuple[str, bytes]:
"""
Read and decompress a Git object.
Returns tuple of (object_type, content).
"""
object_path = os.path.join(git_dir, "objects", sha1[:2], sha1[2:])
with open(object_path, "rb") as f:
compressed = f.read()
# Decompress the zlib-compressed data
raw = zlib.decompress(compressed)
# Find the null byte separating header from content
null_index = raw.index(b"\x00")
header = raw[:null_index].decode("ascii")
content = raw[null_index + 1:]
# Parse header: "type size"
obj_type, size_str = header.split(" ")
size = int(size_str)
# Validate size matches actual content length
if len(content) != size:
raise ValueError(f"Size mismatch: header says {size}, got {len(content)}")
return obj_type, content
def create_tree(entries: list, git_dir: str = ".git") -> str:
"""
Create a Git tree object from a list of entries.
Each entry is a tuple: (mode, name, sha1)
Tree format: concatenated entries of "mode name\0<20-byte-sha>"
"""
# Sort entries by name (Git requires this)
sorted_entries = sorted(entries, key=lambda e: e[1])
# Build tree content
content = b""
for mode, name, sha1 in sorted_entries:
# Mode and name as ASCII, separated by space
entry = f"{mode} {name}\x00".encode("ascii")
# SHA-1 as raw 20 bytes (not hex string)
entry += bytes.fromhex(sha1)
content += entry
# Create object with tree header
header = f"tree {len(content)}\x00".encode("ascii")
store = header + content
sha1 = hashlib.sha1(store).hexdigest()
compressed = zlib.compress(store)
object_dir = os.path.join(git_dir, "objects", sha1[:2])
object_path = os.path.join(object_dir, sha1[2:])
os.makedirs(object_dir, exist_ok=True)
if not os.path.exists(object_path):
with open(object_path, "wb") as f:
f.write(compressed)
return sha1
# Usage example
if __name__ == "__main__":
# Create a blob
test_content = b"Hello, Git internals!\n"
blob_sha = create_blob(test_content)
print(f"Created blob: {blob_sha}")
# Verify with Git command
os.system(f"git cat-file -p {blob_sha}")
# Read it back using our function
obj_type, content = read_object(blob_sha)
print(f"Read back: type={obj_type}, content={content}")
# Create a tree containing the blob
tree_sha = create_tree([
("100644", "hello.txt", blob_sha)
])
print(f"Created tree: {tree_sha}")
os.system(f"git cat-file -p {tree_sha}")

The hash computation deserves special attention. Git hashes the header-plus-content combination, not just the content. This means you’ll get a different hash if you hash only the file content with an external tool like sha1sum. The header ensures that different object types with identical content produce different hashes—a blob and a commit can never accidentally collide even if they contained the same bytes.

verify-hash-computation.sh
# Compare our hash computation with Git's
echo -n "Hello, Git internals!" > test.txt
# Wrong way: hashing just the content
sha1sum test.txt
# Output: 7f8d7a1... (different from Git's hash)
# Git's way: hash includes header
# "blob 21\0Hello, Git internals!"
printf "blob 21\0Hello, Git internals!" | sha1sum
# Output: matches what git hash-object produces
# Verify with Git
git hash-object test.txt
# Output: matches the printf version above

Run the Python script in any Git repository, then verify your objects with git cat-file -p <sha>. You’ll see your content exactly as stored. Git doesn’t know or care that Python created this object rather than git add—the content-addressable design means any tool producing valid objects is a valid Git client.

Note: The “only write if not exists” pattern mirrors Git’s behavior. Since content addressing guarantees that identical content produces identical hashes, rewriting an existing object would be wasteful and potentially dangerous (another process might be reading it).

Understanding this format enables powerful applications: migrating version control history from legacy systems, creating synthetic repositories for testing, implementing Git servers, or building specialized tools that manipulate history in ways porcelain commands don’t support.

Understanding Refs, HEAD, and the Commit Graph

With objects stored immutably in the database, Git needs a way to track which commits matter. That’s where refs come in—and they’re disarmingly simple. Refs are the mutable layer on top of Git’s immutable object database, providing the human-friendly names we use daily.

A ref is a file containing a SHA-1 hash. That’s it. Your branch main exists as .git/refs/heads/main, containing 40 hexadecimal characters plus a newline. The entire concept of “branches” reduces to this: text files with hashes.

explore-refs.sh
# View the main branch ref
cat .git/refs/heads/main
# Output: 8a7b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b
# List all refs
find .git/refs -type f -exec echo "{}:" \; -exec cat {} \;
# Refs can also be "packed" for efficiency
cat .git/packed-refs
# Output: lists refs that have been consolidated
# HEAD is special - it's usually a symbolic ref
cat .git/HEAD
# Output: ref: refs/heads/main
# Remote tracking branches live in refs/remotes
cat .git/refs/remotes/origin/main

HEAD typically contains a symbolic reference—a pointer to another ref rather than a direct SHA-1. When you’re on branch main, HEAD contains ref: refs/heads/main. Git follows this indirection to find the current commit. This indirection is what makes “being on a branch” meaningful—when you commit, Git updates the ref that HEAD points to.

Detached HEAD occurs when HEAD contains a SHA-1 directly instead of a symbolic reference. This happens when you checkout a specific commit, a tag, or a remote branch. You’re no longer “on” any branch; new commits won’t update any branch ref. Understanding this distinction clarifies otherwise confusing Git messages and behaviors.

detached-head.sh
# Enter detached HEAD state
git checkout HEAD~3
# Warning: You are in 'detached HEAD' state...
cat .git/HEAD
# Output: 8a7b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b (direct SHA, no "ref:")
# Commits made here don't update any branch
git commit --allow-empty -m "Orphan commit"
# This commit exists but no branch points to it
# Return to a branch
git checkout main
cat .git/HEAD
# Output: ref: refs/heads/main
# The orphan commit is now "dangling" - reachable only via reflog

The commit graph forms through parent pointers. Starting from any commit, you can traverse its parents recursively to reconstruct history. Branches are entry points into this graph—named positions from which traversal begins. This graph structure is a directed acyclic graph (DAG), and understanding it explains Git’s behavior for operations like git log, git merge, and git rebase.

walk-history.sh
# Walk the commit graph manually
current=$(git rev-parse HEAD)
while [ -n "$current" ]; do
echo "=== Commit: $current ==="
git cat-file -p "$current" | head -5
echo ""
# Get parent (just first parent for simplicity)
current=$(git cat-file -p "$current" 2>/dev/null | grep "^parent" | head -1 | cut -d' ' -f2)
done
# More practical: visualize the graph
git log --oneline --graph --all

Understanding this structure clarifies what branch operations actually do. git branch feature creates a new file at .git/refs/heads/feature containing the current commit’s SHA-1—literally 41 bytes written to disk. git checkout feature updates HEAD to contain ref: refs/heads/feature. git reset --hard HEAD~1 updates the current branch’s ref file to point to the parent commit. No copying, no moving—just pointer manipulation.

This is why Git branches are “cheap”—creating a thousand branches adds only about 40KB to your repository. Compare this to version control systems that copy entire directory trees for each branch.

The reflog adds another layer: Git maintains a log of every position each ref has occupied. This provides a safety net—even after seemingly destructive operations, the reflog remembers where refs pointed previously, enabling recovery.

reflog-safety.sh
# View reflog for HEAD (where HEAD has pointed)
git reflog
# Shows every commit HEAD has referenced
# View reflog for a specific branch
git reflog show main
# Shows every commit main has referenced
# Reflog entries expire (default 90 days for unreachable, 30 for reachable)
# This provides a generous recovery window

Plumbing Commands: Git’s Low-Level Toolkit

Git’s interface divides into porcelain (user-friendly commands like commit, merge, pull) and plumbing (low-level building blocks). Plumbing commands operate directly on the object database and refs, giving you surgical precision for automation and debugging. The names come from plumbing fixtures—porcelain is the pretty interface users interact with, while plumbing is the pipes hidden behind the walls that do the real work.

git hash-object computes an object’s hash and optionally writes it to the database:

plumbing-basics.sh
# Compute hash without storing
echo "test content" | git hash-object --stdin
# Output: d670460b4b4aece5915caf5c68d12f560a9fe3e4
# Compute and store as blob
echo "test content" | git hash-object -w --stdin
# Now the object exists in .git/objects/d6/70460...
# Hash a file (common for adding binary files manually)
git hash-object -w large-binary-file.bin
# Hash as a different object type (rarely needed)
git hash-object -t tree --stdin < tree-content

git cat-file reads objects, as we’ve seen. The -t flag returns the type, -s returns the size, and -p pretty-prints the content. It’s your primary inspection tool for the object database.

cat-file-variations.sh
# Pretty print (auto-formats based on type)
git cat-file -p HEAD
# Get object type
git cat-file -t HEAD
# Output: commit
# Get object size in bytes
git cat-file -s HEAD
# Output: 234
# Check if object exists (exit code 0 if yes)
git cat-file -e abc123def456 && echo "exists" || echo "not found"
# Batch mode for efficiency when checking many objects
echo -e "HEAD\nHEAD~1\nHEAD~2" | git cat-file --batch-check

git update-index manipulates the staging area (index) directly. The index is a binary file at .git/index that tracks what will go into the next commit.

update-index.sh
# Add a blob to the index with a specific mode and path
git update-index --add --cacheinfo 100644 \
d670460b4b4aece5915caf5c68d12f560a9fe3e4 \
newfile.txt
# View the index in detail
git ls-files --stage
# Remove a file from the index without touching the working tree
git update-index --remove filename.txt
# Mark a file as assume-unchanged (useful for local config files)
git update-index --assume-unchanged config.local.txt

git write-tree takes the current index and creates a tree object:

create-commit-manually.sh
# Create a tree from the current index
tree_sha=$(git write-tree)
echo "Tree: $tree_sha"
# Create a commit object pointing to this tree
# The commit-tree command reads the message from stdin
commit_sha=$(echo "Manual commit message" | \
git commit-tree "$tree_sha" -p HEAD)
echo "Commit: $commit_sha"
# Update the branch ref to point to our new commit
git update-ref refs/heads/main "$commit_sha"
# Verify the result
git log --oneline -3

This sequence—hash-object, update-index, write-tree, commit-tree, update-ref—is exactly what git commit does internally. Understanding this decomposition lets you script Git operations that porcelain commands don’t directly support.

Here’s a complete example creating a commit from scratch, demonstrating the full workflow:

full-plumbing-workflow.sh
# Start with a new repo
git init plumbing-demo && cd plumbing-demo
# Create blobs for two files
blob1=$(echo "File one content" | git hash-object -w --stdin)
blob2=$(echo "File two content" | git hash-object -w --stdin)
echo "Blob 1: $blob1"
echo "Blob 2: $blob2"
# Add blobs to the index
git update-index --add --cacheinfo 100644 "$blob1" file1.txt
git update-index --add --cacheinfo 100644 "$blob2" file2.txt
# Create tree from index
tree=$(git write-tree)
echo "Tree: $tree"
# Create initial commit (no parent)
commit=$(echo "Initial commit via plumbing" | git commit-tree "$tree")
echo "Commit: $commit"
# Point main branch to our commit
git update-ref refs/heads/main "$commit"
# Point HEAD to main
git symbolic-ref HEAD refs/heads/main
# Verify everything works
git log --oneline
git ls-files

Warning: Plumbing commands bypass safety checks. git update-ref won’t warn you about losing commits. git update-index won’t validate that blobs exist. Use them deliberately, not casually.

Common automation scenarios for plumbing commands include: creating commits with specific timestamps for reproducible builds, manipulating history during migration from other VCS systems, implementing custom merge strategies, building Git-based data stores for non-code content, and creating synthetic test repositories.

Practical Applications: Debugging and Recovery

Your internals knowledge pays dividends when things go wrong. Let’s apply it to real recovery scenarios. The most stressful Git moments—lost commits, corrupted repositories, unexplained behavior—become manageable when you understand what’s actually happening at the object level.

git fsck (filesystem check) validates the object database and reports problems:

fsck-diagnosis.sh
# Check repository integrity
git fsck --full
# Common output you might see:
# Checking object directories: 100% (256/256), done.
# dangling blob 8a7b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b
# dangling commit 1234567890abcdef1234567890abcdef12345678
# More verbose output
git fsck --full --verbose
# Check connectivity only (faster)
git fsck --connectivity-only
# Find unreachable objects
git fsck --unreachable

Dangling objects result from normal Git operations: amended commits leave the originals dangling, deleted branches orphan their commits, and reset operations create dangling commits. These objects remain until garbage collection removes them. Dangling objects are not corruption—they’re orphaned content that’s no longer reachable from any ref but still exists in the database.

Understanding what creates dangling objects helps you not panic when you see them:

  • git commit --amend creates a new commit, leaving the original dangling
  • git rebase creates new commits for each rebased commit, leaving originals dangling
  • git reset --hard HEAD~1 moves the branch pointer, leaving the abandoned commit dangling
  • Deleting a branch orphans all commits unique to that branch

Recovering lost commits relies on the reflog—Git’s diary of ref changes:

recovery.sh
# View reflog for HEAD (shows where HEAD has pointed)
git reflog
# Output shows every position HEAD has occupied:
# 8a7b3c4 HEAD@{0}: reset: moving to HEAD~3
# 1234567 HEAD@{1}: commit: Add feature
# abcdef0 HEAD@{2}: commit: Fix bug
# Search reflog for specific patterns
git reflog | grep "feature"
# View reflog for a specific branch
git reflog show main
# Recover the lost commits by creating a new branch
git checkout -b recovery 1234567
# Or reset your current branch back to the lost commit
git reset --hard 1234567
# If reflog is empty/expired, find dangling commits
git fsck --lost-found
# Dangling commits are written to .git/lost-found/commit/
# Inspect dangling commits to find your lost work
for commit in $(git fsck --lost-found 2>/dev/null | grep "dangling commit" | cut -d' ' -f3); do
echo "=== $commit ==="
git log --oneline -1 "$commit"
done

The reflog maintains entries for 90 days by default, giving you a generous recovery window. Even after git reset --hard, your commits exist in the object database until garbage collection runs. This is why Git is forgiving—“permanent” data loss requires both removing all references AND running garbage collection AND waiting for the expiration period.

Understanding garbage collection helps you know what’s safe and what’s at risk:

gc-explained.sh
# See what would be cleaned up (dry run isn't directly available,
# but you can check unreachable objects first)
git fsck --unreachable
# Standard garbage collection
git gc
# Aggressive cleanup with more thorough packing
git gc --aggressive
# Prune immediately (removes protection period)
git gc --prune=now
# Prune unreachable objects older than 2 weeks (default)
git prune --expire=2.weeks.ago --dry-run
# Keep all dangling objects for now
git config gc.pruneExpire never

Git protects recent dangling objects from collection. By default, git gc only removes unreachable objects older than two weeks. The --prune=now flag overrides this protection—use it only when you’re certain you need no recovery.

Here’s a practical recovery scenario with step-by-step resolution:

recovery-scenario.sh
# Scenario: You accidentally ran git reset --hard and lost commits
# Step 1: Don't panic. Check the reflog.
git reflog -10
# Step 2: Identify the commit you want to recover
# Look for the commit message or SHA you remember
# Step 3: Verify the commit still exists
git cat-file -t <sha-from-reflog>
# Should output: commit
# Step 4: Recover by creating a branch or resetting
git branch recovered-work <sha-from-reflog>
# or
git reset --hard <sha-from-reflog>
# Step 5: Verify recovery
git log --oneline recovered-work

Pro Tip: Before any risky operation, run git reflog expire --expire=never --all to prevent reflog expiration, then proceed. Your safety net remains intact indefinitely. You can also use git stash or create a temporary branch as a safety bookmark.

Beyond the Basics: Packfiles and Performance

The loose object storage we’ve explored works well for small repositories, but imagine cloning a project with 100,000 commits. Individual zlib-compressed files add overhead: filesystem metadata, compression dictionary repetition, and no delta compression between similar objects.

Git solves this with packfiles. The git gc command (and git repack explicitly) consolidates loose objects into .pack files with accompanying .idx index files in .git/objects/pack/.

examine-packs.sh
# See current pack files
ls -la .git/objects/pack/
# Examine pack contents
git verify-pack -v .git/objects/pack/pack-*.idx | head -20
# See statistics about packing
git count-objects -v
# Output includes:
# count: 123 (loose objects)
# packs: 2 (number of pack files)
# size-pack: 4567 (pack size in KB)

Packfiles employ delta compression: rather than storing two similar blobs independently, Git stores one fully (the base) and the other as a delta—a set of instructions to transform the base into the target. For text files that change incrementally between commits, this achieves remarkable compression ratios. A file that changes by one line between 100 commits might be stored as one full copy plus 99 tiny deltas.

The delta compression is content-aware, not version-aware. Git chooses delta bases based on similarity, not history. A file might use a completely unrelated file as its delta base if they happen to share significant content. This approach often finds better compression than history-based deltas would.

The index file (.idx) enables random access. Without it, finding an object in a packfile would require linear scanning. The index maps SHA-1 hashes to byte offsets within the pack, enabling O(log n) lookup via binary search. This is why Git remains fast even with millions of objects—lookups don’t degrade to scanning.

pack-operations.sh
# Force a repack of loose objects
git repack -d
# Aggressive repack (finds better deltas, slower)
git repack -a -d -f --depth=250 --window=250
# Unpack a pack file back to loose objects (rarely needed)
git unpack-objects < .git/objects/pack/pack-*.pack

When should you think about packing? Large repositories benefit from periodic repacking, especially after importing history from other systems. The --aggressive flag to git gc triggers more expensive delta computation that sometimes finds better compression. Servers that handle many clones benefit from well-packed repositories—Git transfers packs directly during clone operations.

Clone operations transfer packfiles directly—Git generates a pack containing all requested objects and sends it over the network. This is why cloning a large repository downloads a single large file rather than millions of small ones. The pack is generated on-demand based on what the client needs.

For most repositories, Git’s automatic packing during git gc handles everything. You’d manually intervene only when repository size becomes problematic, when preparing a repository for archival, when optimizing a server repository handling many clones, or when importing from other version control systems.

The object database’s elegance lies in this separation: the logical model (content-addressed blobs, trees, commits) remains constant whether objects are loose or packed. Higher-level operations never need to know which storage format underlies a particular object. This abstraction lets Git optimize storage without changing semantics.

Key Takeaways

  • Use git cat-file -p <sha> to inspect any object and debug repository issues at the storage level. This single command demystifies most Git internals questions.

  • Implement a simple blob writer in Python to internalize the object format—the exercise cements your understanding and enables custom tooling for migration, testing, and automation scenarios.

  • Master reflog and fsck as your first tools when something goes wrong—they reveal what porcelain commands hide. The reflog is your safety net; fsck is your diagnostic tool.

  • Run git gc with understanding: know what’s safe to delete, what dangling objects you might need, and how the two-week grace period protects recent work.

  • Remember that branches are just files containing SHA-1 hashes—this mental model makes branch operations, merges, and rebases conceptually clear.

  • Distinguish between content (objects) and references (refs)—objects are immutable and content-addressed; refs are mutable pointers that give names to commits.

Armed with this knowledge, you’ll approach Git not as a mysterious tool but as a well-designed system with understandable internals. When git fsck reports dangling blobs or git gc behaves unexpectedly, you’ll know exactly where to look and what to do. The next time a colleague loses work, you can be the one who recovers it—not through magic, but through understanding.