Object Database (ODB)

The Object Database (ODB) is the core storage engine for MediaGit, managing content-addressable objects with SHA-256 hashing.

Architecture

graph TB
    API[ODB API] --> Cache[In-Memory Cache]
    Cache --> Compression[Compression Layer]
    Compression --> Backend[Storage Backend]

    API --> |Write| Hash[SHA-256 Hasher]
    Hash --> Cache

    Backend --> Local[Local FS]
    Backend --> S3[Amazon S3]
    Backend --> Azure[Azure Blob]
    Backend --> Cloud[Other Cloud Providers]

    style API fill:#e1f5ff
    style Cache fill:#fff4e1
    style Compression fill:#e8f5e9

Core Operations

Writing Objects

#![allow(unused)]
fn main() {
// 1. Calculate SHA-256 hash
let oid = sha256(&content);

// 2. Check cache
if cache.contains(&oid) {
    return Ok(oid); // Already stored
}

// 3. Compress content
let compressed = compress(&content, CompressionAlgorithm::Zstd);

// 4. Write to backend
backend.put(&oid.to_path(), &compressed).await?;

// 5. Update cache
cache.insert(oid, content);
}

Reading Objects

#![allow(unused)]
fn main() {
// 1. Check cache
if let Some(content) = cache.get(&oid) {
    return Ok(content);
}

// 2. Read from backend
let compressed = backend.get(&oid.to_path()).await?;

// 3. Decompress
let content = decompress(&compressed)?;

// 4. Verify integrity
let actual_oid = sha256(&content);
if actual_oid != oid {
    return Err(CorruptedObject);
}

// 5. Update cache
cache.insert(oid, content.clone());
Ok(content)
}

Object Types

Blob Objects

Purpose: Store raw file content
Format: Pure bytes (no metadata)
Example: PSD file → compressed blob
Size: Unlimited (automatic chunking for large files)

Large File Handling: For files exceeding type-specific thresholds (5-10MB), MediaGit automatically chunks the content:

Chunk Size: 1-8 MB adaptive based on file size (via get_chunk_params())
Strategy: Content-defined chunking (FastCDC v2020) for deduplication
Overhead: Minimal (chunk index ~0.1% of file size)
Benefit: Parallel processing and efficient delta compression

Tree Objects

Purpose: Represent directories

Format:

<mode> <type> <oid> <name>
100644 blob a3c5d... README.md
100644 blob f7e2a... large.psd
040000 tree b8f3c... assets/

Sorted: Entries sorted by name for consistent hashing

Commit Objects

Purpose: Snapshot with metadata

Format:

tree <tree-oid>
parent <parent-oid>
author <name> <email> <timestamp>
committer <name> <email> <timestamp>

<commit message>

Tag Objects

Purpose: Annotated tags with metadata

Format:

object <commit-oid>
type commit
tag v1.0.0
tagger <name> <email> <timestamp>

<tag message>

Object Addressing

OID (Object ID)

Hash: SHA-256 (64 hex characters)
Example: 5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03

Path Mapping

Objects stored with 2-character prefix for directory sharding:

OID: 5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03
Path: objects/58/91b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03

Benefit: Prevents single directory with millions of files (filesystem optimization)

Caching Strategy

Memory Cache

Implementation: LRU cache (Least Recently Used)
Default Size: 100 MB
Eviction: Oldest unused objects evicted first
Hit Rate Target: >80% for typical workflows

Cache Warming

Pre-load frequently accessed objects:

HEAD commit and tree
Recent commits (last 10)
Currently checked out files

Cache Invalidation

Object modification (rare due to immutability)
Explicit cache clear (mediagit gc --clear-cache)
Repository verification failures

Compression Integration

Algorithm Selection

#![allow(unused)]
fn main() {
fn select_compression(path: &Path, size: u64) -> CompressionAlgorithm {
    match path.extension() {
        // Already compressed media
        Some("mp4" | "mov" | "jpg" | "png") => CompressionAlgorithm::None,

        // Lossless audio (uncompressed — good zstd ratio)
        Some("wav" | "flac" | "aiff") => CompressionAlgorithm::Zstd,

        // Text and code
        Some("txt" | "md" | "rs" | "py") => CompressionAlgorithm::Brotli,

        // Large binaries
        Some("psd" | "blend" | "fbx") if size > 10_MB => {
            CompressionAlgorithm::ZstdWithDelta
        }

        // Default
        _ => CompressionAlgorithm::Zstd,
    }
}
}

Compression Levels

Fast: zstd level 1 (150 MB/s compression)
Default: zstd level 3 (100 MB/s, better ratio)
Best: zstd level 19 (slow, maximum compression)

Delta Encoding

When to Use Deltas

File size > 10 MB
Multiple versions exist
File type supports delta (PSD, Blender, FBX)

Delta Chain Management

Base Object (v1)    100 MB
  ↓ delta
Object v2           + 5 MB (delta)
  ↓ delta
Object v3           + 3 MB (delta)
  ↓ delta
Object v4           + 2 MB (delta)

Total: 110 MB (instead of 400 MB)
Chain depth: 3

Chain Breaking

Maximum depth: 10 (MAX_DELTA_DEPTH)
After depth exceeded, new base created
mediagit gc optimizes chains

Integrity Verification

Read-Time Verification

Every object read is verified:

#![allow(unused)]
fn main() {
let content = backend.get(&oid.to_path()).await?;
let actual_oid = sha256(&content);
if actual_oid != oid {
    return Err(OdbError::CorruptedObject {
        expected: oid,
        actual: actual_oid,
    });
}
}

Bulk Verification

mediagit verify checks all objects:

Read every object
Verify SHA-256 hash
Report corrupted objects
Optionally repair from remote

Repair Operations

# Verify repository
mediagit verify

# Fetch missing/corrupted objects from remote
mediagit verify --fetch-missing

# Aggressive repair (expensive)
mediagit verify --repair --fetch-missing

Performance Optimization

Parallel Object Access

#![allow(unused)]
fn main() {
// Fetch multiple objects concurrently
let futures: Vec<_> = oids.iter()
    .map(|oid| odb.read(oid))
    .collect();

let objects = futures::future::join_all(futures).await;
}

Batch Operations

#![allow(unused)]
fn main() {
// Write multiple objects in one backend call
odb.write_batch(&[
    (oid1, content1),
    (oid2, content2),
    (oid3, content3),
]).await?;
}

Memory Management

Stream large objects (chunking)
Memory-mapped files for very large blobs
Automatic cache eviction under memory pressure

Large File Chunking

MediaGit implements intelligent chunking for efficient large file storage and processing.

Chunking Strategy

Content-Defined Chunking (CDC):

#![allow(unused)]
fn main() {
// Chunks split at natural content boundaries
// Uses FastCDC v2020 gear-table hash for O(1)/byte
// Average chunk size: 1-8 MB (adaptive by file size)
// Range: 512 KB - 32 MB
}

Benefits:

Deduplication: Identical chunks shared across files
Parallel Processing: Chunks processed concurrently
Delta Efficiency: Small changes affect few chunks
Memory Efficiency: Stream without loading entire file

Automatic Chunking Thresholds

File Size	Chunking Strategy	Chunk Count
< 5-10 MB (type-dependent)	No chunking (single blob)	1
5-100 MB	FastCDC (1 MB avg)	5-100
100 MB - 10 GB	FastCDC (2 MB avg)	50-5000
> 10 GB	FastCDC (4-8 MB avg)	2500+

Chunking Configuration

[storage.chunking]
# Enable automatic chunking
enabled = true

# Chunk sizes are adaptive by file size (from get_chunk_params()):
# < 100 MB:     avg 1 MB,  min 512 KB, max 4 MB
# 100 MB-10 GB: avg 2 MB,  min 1 MB,   max 8 MB
# 10-100 GB:    avg 4 MB,  min 1 MB,   max 16 MB
# > 100 GB:     avg 8 MB,  min 1 MB,   max 32 MB

Chunking Performance

Validated with 6GB Test File:

Chunks Created: 1,541 chunks
Average Chunk Size: 4.12 MB (adaptive based on content)
Processing: Parallel chunk compression and delta encoding
Throughput: 35.5 MB/s (streaming mode)
Memory Usage: < 256 MB (constant, regardless of file size)

Chunk Storage

Chunks stored as individual objects:

File: large-video.mp4 (6 GB)
  → Chunk 1: 58/91b5b522... (2 MB)
  → Chunk 2: a3/c5d7e2f... (2 MB)
  → Chunk 3: f7/e2a1b8c... (2 MB)
  ...
  → Chunk Index: 5a/2b3c4d... (metadata)

Chunk Index contains:

Chunk OIDs (SHA-256 hashes)
Chunk offsets in original file
Chunk sizes
Reconstruction order

Deduplication Benefits

When same content appears in multiple files:

File A (5 GB):
  → Chunks: [C1, C2, C3, C4, C5]

File B (5 GB with 60% overlap):
  → Chunks: [C1, C2, C3, C6, C7]
  → Only C6, C7 stored (2 GB)
  → C1-C3 reused (deduplication)

Storage: 7 GB instead of 10 GB (30% savings)

Storage Backend Integration

Backend Requirements

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Backend {
    async fn get(&self, key: &str) -> Result<Vec<u8>>;
    async fn put(&self, key: &str, data: &[u8]) -> Result<()>;
    async fn exists(&self, key: &str) -> Result<bool>;
    async fn delete(&self, key: &str) -> Result<()>;
}
}

Object Key Format

objects/{prefix}/{oid}
objects/58/91b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03

Backend-Specific Optimizations

S3: Multipart upload for large objects
Azure: Parallel block upload
Local: Direct file I/O, no network overhead

Garbage Collection

Reachability Analysis

graph LR
    Refs[refs/heads/main<br/>refs/tags/v1.0] --> C1[Commit 1]
    C1 --> T1[Tree 1]
    T1 --> B1[Blob 1]
    T1 --> B2[Blob 2]

    Orphan[Orphaned Blob]

    style Refs fill:#e1f5ff
    style Orphan fill:#ffcccc

GC Process

Mark: Traverse from all refs
Sweep: Delete unmarked objects
Repack: Optimize delta chains
Verify: Check repository integrity

Safety Mechanisms

Grace period (14 days default)
Reflog preservation
Dry-run mode
Backup recommendations

Error Handling

Common Errors

ObjectNotFound: OID not in database
CorruptedObject: Hash mismatch
BackendError: Storage backend failure
CompressionError: Decompression failure

Recovery Strategies

Retry with exponential backoff (transient errors)
Fetch from remote (missing objects)
Repair with verification (corruption)
Fallback to uncompressed storage (compression errors)

Monitoring Metrics

Key Metrics

Object count
Total size (compressed vs uncompressed)
Cache hit rate
Average object size
Delta chain depth distribution

Performance Metrics

Read latency (p50, p99)
Write latency (p50, p99)
Compression ratio
Backend throughput

API Reference

See API Documentation for detailed Rust API documentation.

Keyboard shortcuts

MediaGit-Core Documentation