Photos in. 77 MB compressed splat out.
One Modal job. Upload a folder of photos; receive a Scaffold-GS QAT-Bundle splat. No PLY required, no client-side training, no hand-holding through COLMAP. Measured on bonsai (292 photos, Mip-NeRF 360 sequence): 49.5 min end-to-end, $2.07 Modal spend.
- Stage 1
COLMAP
sparse 3D from photos- Wall
- 5.4 min
- Modal $
- $0.05
Feature extraction (CPU SIFT) → exhaustive matcher → mapper. Outputs sparse/0/{cameras,images,points3D}.bin which Scaffold-GS consumes via -s <workspace>. 292 photos, single SIMPLE_PINHOLE intrinsic.
- Stage 2
Scaffold-GS
30k iterations · anchor representation- Wall
- 37.4 min
- Modal $
- $1.56
Upstream Scaffold-GS train.py at voxel_size=0.001, ratio=1, n_offsets=10. Trained PSNR 32.83 dB (test eval @ 30k iter). 411k anchors. Output is a 130 MB anchor PLY + 3 MLPs (color, cov, opacity).
- Stage 3
QAT-Bundle
quant-aware finetune + encode- Wall
- 6.6 min
- Modal $
- $0.46
5,000-iter QAT finetune at lr_init=2e-4, lr_final=1e-5, per-channel int4 quant on anchor + offset, constant-strip on opacity. Net PSNR delta: +0.521 dB (33.41 dB post-QAT vs 32.89 dB pre). PLY save: 40.5%.
- Stage 4
Upload
Vercel Blob → public URL- Wall
- 0.1 min
- Modal $
- $0.00
77 MB output PLY uploaded to Vercel Blob. Customer receives a public URL via API callback identical to the PLY-in / PLY-out flow.
The output
QAT-Bundle PLY
- Anchors
- 411,066
- Final PLY
- 77.3 MB
- Baseline PLY
- 129.9 MB
- Baseline PSNR
- 32.89 dB
- Post-QAT PSNR
- 33.41 dB
- Finetune iter
- 5,000
Single-PLY Scaffold-GS artifact. Drop into any Scaffold-GS renderer (Inria reference, gsplat, splat-transform) or re-encode through SplatForge's web-mobile preset for a progressive-streaming bundle.
Reproduce
# 1. Pack your photos
zip photos.zip my-photos/*.jpg
# 2. Upload + dispatch (via splatforge CLI)
splatforge capture submit \
--photos photos.zip \
--preset splatforge-qat-bundle \
--out ./out.ply
# 3. Or POST direct to the Modal app
curl -X POST https://montabano1--splatforge-capture-enqueue.modal.run/ \
-H 'content-type: application/json' \
-d '{
"job_id": "demo-001",
"preset": "capture-and-compress",
"blob_url": "<vercel-blob-url>",
"filename": "photos.zip",
"inner_preset": "splatforge-qat-bundle",
"training_iters": 30000,
"callback_url": "<your-webhook>"
}'
Job IDs are customer-supplied; the callback URL receives
per-phase progress (fetching, colmap,
training, encoding,
uploading) and a terminal {status, output_url, metrics} POST.
Honest input requirements
COLMAP is the bottleneck on capture quality. For a successful run we recommend:
- 50–500 photos in a single orbit (more = exhaustive matcher gets quadratic; less = under-constrained).
- 60–80% overlap between adjacent frames. Walk slowly, don't sprint.
- Textured scene. A bare wall or a glass coffee table will silently fail at the mapper step — we surface this as a structured "too few features matched" error rather than a generic non-zero exit.
- Static scene. Anything moving (pets, foliage in wind, your hand) introduces ghost-anchor artifacts that Scaffold-GS can't recover from.
- Single camera. Mixed phones / DSLRs work but COLMAP's intrinsics are fit per-camera-model; one device per session is the happy path.
The MipNeRF360 bonsai input (above) is a benchmark-grade sequence — a well-lit, textured indoor scene with a clean orbit. Real-world phone captures land at PSNR delta ±2 dB of these numbers depending on how closely they match these conditions.
Run this on your own captures
The /capture endpoint is gated behind the Design Partner tier ($0/mo, capacity-capped) while we baseline real-world phone captures. Drop your email + one example zip and we'll wire it up.
Apply for the Design Partner tier →