Capture pipeline · end-to-end · measured

Photos in. 77 MB compressed splat out.

One Modal job. Upload a folder of photos; receive a Scaffold-GS QAT-Bundle splat. No PLY required, no client-side training, no hand-holding through COLMAP. Measured on bonsai (292 photos, Mip-NeRF 360 sequence): 49.5 min end-to-end, $2.07 Modal spend.

Source: Mip-NeRF 360 (bonsai) 292 photos baseline 32.89 dB → 33.41 dB (+0.52 dB) 130 MB → 77 MB (40.5% save)

Stage 1
COLMAP
sparse 3D from photos

Wall

5.4 min

Modal $

$0.05

Feature extraction (CPU SIFT) → exhaustive matcher → mapper. Outputs sparse/0/{cameras,images,points3D}.bin which Scaffold-GS consumes via -s <workspace>. 292 photos, single SIMPLE_PINHOLE intrinsic.
Stage 2
Scaffold-GS
30k iterations · anchor representation

Wall

37.4 min

Modal $

$1.56

Upstream Scaffold-GS train.py at voxel_size=0.001, ratio=1, n_offsets=10. Trained PSNR 32.83 dB (test eval @ 30k iter). 411k anchors. Output is a 130 MB anchor PLY + 3 MLPs (color, cov, opacity).
Stage 3
QAT-Bundle
quant-aware finetune + encode

Wall

6.6 min

Modal $

$0.46

5,000-iter QAT finetune at lr_init=2e-4, lr_final=1e-5, per-channel int4 quant on anchor + offset, constant-strip on opacity. Net PSNR delta: +0.521 dB (33.41 dB post-QAT vs 32.89 dB pre). PLY save: 40.5%.
Stage 4
Upload
Vercel Blob → public URL

Wall

0.1 min

Modal $

$0.00

77 MB output PLY uploaded to Vercel Blob. Customer receives a public URL via API callback identical to the PLY-in / PLY-out flow.

Total wall 49.5 min

Total Modal spend $2.07

PSNR Δ vs Scaffold-GS-baseline +0.52 dB

PLY size reduction 40.5%

The output

QAT-Bundle PLY

Anchors: 411,066
Final PLY: 77.3 MB
Baseline PLY: 129.9 MB
Baseline PSNR: 32.89 dB
Post-QAT PSNR: 33.41 dB
Finetune iter: 5,000

Download bonsai_qat_bundle.ply (77.3 MB)

Single-PLY Scaffold-GS artifact. Drop into any Scaffold-GS renderer (Inria reference, gsplat, splat-transform) or re-encode through Catetus's web-mobile preset for a progressive-streaming bundle.

Reproduce

# 1. Pack your photos
zip photos.zip my-photos/*.jpg

# 2. Upload + dispatch (via catetus CLI)
catetus capture submit \
  --photos photos.zip \
  --preset catetus-qat-bundle \
  --out ./out.ply

# 3. Or POST direct to the hosted endpoint
curl -X POST https://api.catetus.com/v1/capture/enqueue \
  -H 'content-type: application/json' \
  -d '{
    "job_id": "demo-001",
    "preset": "capture-and-compress",
    "blob_url": "<vercel-blob-url>",
    "filename": "photos.zip",
    "inner_preset": "catetus-qat-bundle",
    "training_iters": 30000,
    "callback_url": "<your-webhook>"
  }'

Job IDs are customer-supplied; the callback URL receives per-phase progress (fetching, colmap, training, encoding, uploading) and a terminal {status, output_url, metrics} POST.

Honest input requirements

COLMAP is the bottleneck on capture quality. For a successful run we recommend:

50–500 photos in a single orbit (more = exhaustive matcher gets quadratic; less = under-constrained).
60–80% overlap between adjacent frames. Walk slowly, don't sprint.
Textured scene. A bare wall or a glass coffee table will silently fail at the mapper step — we surface this as a structured "too few features matched" error rather than a generic non-zero exit.
Static scene. Anything moving (pets, foliage in wind, your hand) introduces ghost-anchor artifacts that Scaffold-GS can't recover from.
Single camera. Mixed phones / DSLRs work but COLMAP's intrinsics are fit per-camera-model; one device per session is the happy path.

The MipNeRF360 bonsai input (above) is a benchmark-grade sequence — a well-lit, textured indoor scene with a clean orbit. Real-world phone captures land at PSNR delta ±2 dB of these numbers depending on how closely they match these conditions.

Run this on your own captures

The /capture endpoint is gated behind the Design Partner tier ($0/mo, capacity-capped) while we baseline real-world phone captures. Drop your email + one example zip and we'll wire it up.

Apply for the Design Partner tier →