SplatBench v3 · methodology
How the continuous bench is run
The numbers on /bench/history and /bench are not hand-curated. Every codec preset is run against every scene in the corpus on every commit to public main. Below is the run model end-to-end, the determinism strategy, and the PR template for adding a preset or scene.
1. Run model
-
On
pushto main, GitHub Actions (.github/workflows/splatbench-v3.yml) readsbenches/scenes/manifest.jsonand generates a matrix of every (preset, scene) cell. -
Each matrix entry calls the preset's Modal endpoint via
modal.Function.lookup(app, fn).remote(...), passing the scene'ssourceUrland a stable per-cell run id (sha256(commit|preset|scene)). -
The Modal endpoint downloads the PLY, runs the codec on A100, uploads the
compressed artifact, and returns
{ply_save_pct, delta_psnr_db, delta_ssim, delta_lpips, wall_secs, modal_cost_cents, output_url}. -
Each matrix entry writes the cell JSON to
benches/timeseries/<commit>/<preset>__<scene>.json. -
An aggregator job collates all cells into
benches/timeseries/<commit>.json(time-series row) and refreshes thecontinuousBenchblock ofbenches/reports/splatbench-v0.json(powers the dashboard). -
The aggregator applies the ≤0.3 dB PSNR drop rule: if a
cell's new ΔPSNR is >0.3 dB worse than the previously-recorded value,
the leaderboard keeps the old value, and the regression is recorded in
continuousBench.lastRejections. The per-commit time-series always records the new number — we don't silently hide regressions. -
Aggregator commits the time-series + report refresh back to main as
splatbench-bot <bench@splatforge.dev>.
2. Why CI-driven
Anyone reading the bench can audit the workflow file, the dispatcher
(benches/splatbench-v3-dispatch.py), and the aggregator
(benches/splatbench-v3-aggregate.mjs). The Modal endpoints
they call live in separate apps the worker forwards to — same code path
that customers hit at /try-it. There is no special "bench
build" of the codecs; you're measuring the production encoder.
The previous bench (SplatBench v0) was hand-curated: a human ran the encoders locally, parsed the output, and pasted into a JSON. v3 removes that step so credibility scales — adding a column doesn't cost a human afternoon, and there's no temptation to skip a scene that didn't look good.
3. Determinism & reproducibility
The matrix is idempotent on the same commit within a small tolerance:
- Scene hashes are pinned in
manifest.json. First successful workflow run computes the blake3 of the resolved PLY; subsequent runs verify the hash matches before encoding. - Per-cell run id is deterministic:
sha256(commit | preset | scene)[:16]. Modal correlation IDs and on-disk artifacts use this so re-running on the same commit finds (and may reuse) prior work. - Encoder seeds are pinned at the preset level:
every trainer (Scaffold-GS, vanilla 3DGS, HAC++) uses
seed=42for the QAT retrain pass. - Stochastic tolerance: re-runs on the same commit should reproduce numbers within ±0.01 dB. The aggregator's drop guard fires at 0.3 dB so stochastic flapping never causes a false positive.
4. Adding a preset
-
Deploy the encoder as a Modal app. The entrypoint must accept
(preset, blob_url, filename, run_id)and return the cell envelope. -
Register the dispatch URL in
apps/worker/worker.py:PRESET_DISPATCH_URLSand the preset value inbenches/scenes/manifest.json:presets. -
Add the (Modal-app, function) tuple to
benches/splatbench-v3-dispatch.py:PRESET_TO_MODAL. - Open a PR; the regression-alert workflow will post a comment showing what cells the new preset will fill on merge.
5. Adding a scene
Append an entry to benches/scenes/manifest.json:
{
"id": "myscene_iter7k",
"splatCount": 0,
"bytesIn": 0,
"shDegree": 3,
"hash": "pending:first-bench-run",
"sourceUrl": "https://your-host/point_cloud.ply",
"license": "CC-BY-4.0 or compatible open license",
"class": "outdoor-scene",
"evalCameras": 24,
"dataset": "your-dataset",
"presetAllowList": null
}
The first successful workflow run pins the hash + splat count + byte size.
Scenes with restrictive licenses go through presetAllowList
to skip presets that re-upload the PLY publicly.
6. Budget & safety
Each full matrix run is bounded by the Modal workspace concurrency cap (~10 parallel A100 slots)
and the workspace $50 cap. A typical full run (7 presets × 14 scenes = 98 cells)
with the current preset mix costs $30–$60 depending on which premium-tier QAT cells fire. The
workflow can be disabled at any time via gh workflow disable splatbench-v3.