Large File Optimization
Strategies for handling very large files — video masters, high-resolution image sequences, 3D scene files, and game assets.
How MediaGit Handles Large Files
MediaGit automatically adapts its behavior based on file size:
| File size | Chunker | Typical chunks | Strategy |
|---|---|---|---|
| < 10 MB | FastCDC (small params) | 2–10 | Single-threaded |
| 10–100 MB | FastCDC (medium params) | 10–100 | Single-threaded |
| > 100 MB | StreamCDC | 100–2000 | Parallel workers |
| MP4 / MKV / WebM | Video container-aware | 1 per GOP | Per-format |
| PSD | Layer-aware | 1 per layer group | Per-format |
| WAV | FastCDC | Adaptive by size | Per-format |
No configuration is required. MediaGit detects file size and type automatically.
Parallel Ingestion
The single most effective optimization for large files is parallelism. MediaGit uses all CPU cores by default:
# Default: uses all cores
mediagit add assets/
# Explicit job count (match to your I/O bandwidth, not just CPU count)
mediagit add --jobs 8 assets/
# Disable for debugging or resource-constrained systems
mediagit add --no-parallel assets/
Expected throughput (validated benchmarks):
| File type | Sequential | Parallel (16 cores) |
|---|---|---|
| PSD (71 MB) | ~2 MB/s | ~35 MB/s |
| MP4 (500 MB) | ~3 MB/s | ~20 MB/s |
| Pre-compressed (JPEG) | ~80 MB/s | ~200 MB/s |
For very large files (10–100 GB), I/O tends to be the bottleneck rather than CPU. Use SSDs and tune --jobs to match your disk’s sequential read throughput divided by average chunk size.
Adding a 1 TB Media Collection
A 1 TB media collection with 16 CPU cores and an SSD:
# Time estimate: 33–105 minutes depending on content
mediagit add --jobs 16 /media/collection/
Progress is shown per-file and per-chunk. The parallel pipeline:
- File-level: multiple files processed concurrently (bounded semaphore)
- Chunk-level: each file’s chunks compressed and stored in parallel (async-channel producer-consumer)
Memory Usage for Large Files
Each worker holds one uncompressed chunk in memory. Chunk sizes are approximately:
- FastCDC medium: 4–32 MB per chunk
- StreamCDC (>100 MB files): 1–8 MB per chunk (adaptive by file size)
With --jobs 16 and 32 MB average chunk size, expect ~512 MB peak memory during add.
Tune the object cache separately from worker memory:
[performance.cache]
max_size = 1073741824 # 1 GB — for repositories with many reads
Reduce if your system has less than 8 GB RAM:
[performance.cache]
max_size = 268435456 # 256 MB
Cloud Backend Tips for Large Files
S3 / MinIO
Increase connection pool and concurrency for large parallel uploads:
[performance]
max_concurrency = 32
[performance.connection_pool]
max_connections = 32
[performance.timeouts]
request = 300 # 5 minutes for very large chunks
write = 120
Use a bucket in the same region as your workstation or CI runner.
Azure Blob
The Azure backend uses block upload for large objects. Increase timeout if uploads fail:
[performance.timeouts]
write = 120
Local Filesystem
For local storage of very large repos, sync = true ensures data safety on crash at the cost of ~30% write throughput:
[storage]
backend = "filesystem"
base_path = "./data"
sync = false # set true for critical data
Delta Encoding for Large Files
MediaGit applies delta encoding when a new version of a file has chunks similar to the stored version. For large files, delta encoding can reduce storage from GB to MB per revision:
v1.psd: 500 MB (base)
v2.psd: 15 MB (delta — only changed layers stored)
v3.psd: 8 MB (delta — minor touch-up)
Total: 523 MB (vs 1,500 MB without delta)
Delta chains are capped at depth 10 to prevent slow reads. After 10 revisions, the next version is stored as a new base.
Run mediagit gc periodically to optimize chain depth:
mediagit gc
Garbage Collection
After deleting branches or files containing large objects, run GC to reclaim storage:
mediagit gc
For maximum reclamation (slower):
mediagit gc --aggressive
Integrity Verification
After adding very large files, verify chunk integrity:
# Quick checksum check
mediagit fsck
# Full chunk-level verification (slower)
mediagit verify --path /path/to/large-file.mov
File Format Recommendations
| File type | Notes |
|---|---|
| MP4 / MOV / MKV | Already compressed; stored as-is. Deduplication works at GOP level. |
| JPEG / PNG / WebP | Already compressed; stored as-is. No re-compression overhead. |
| PSD / PSB | Layer-aware chunking + Zstd compression. Excellent delta savings per revision. |
| TIFF (uncompressed) | Zstd compresses well. Large but effective delta encoding. |
| EXR | Typically compressed. Stored as-is. |
| WAV / AIFF | Audio-aware chunking. Zstd compresses ~40–60% on uncompressed audio. |
| PDF / AI / InDesign | PDF containers with internal compression; stored as-is. |
| ZIP / DOCX / XLSX | ZIP containers; stored as-is. |
| 3D (OBJ, FBX, GLB, STL) | Binary 3D data; Zstd Best compression applied. |