SplatBench v3 · methodology

How the continuous bench is run

The numbers on /bench/history and /bench are not hand-curated. Every codec preset is run against every scene in the corpus on every commit to public main. Below is the run model end-to-end, the determinism strategy, and the PR template for adding a preset or scene.

1. Run model

On push to main, GitHub Actions (.github/workflows/splatbench-v3.yml) reads benches/scenes/manifest.json and generates a matrix of every (preset, scene) cell.
Each matrix entry calls the preset's Modal endpoint via modal.Function.lookup(app, fn).remote(...), passing the scene's sourceUrl and a stable per-cell run id (sha256(commit|preset|scene)).
The Modal endpoint downloads the PLY, runs the codec on A100, uploads the compressed artifact, and returns {ply_save_pct, delta_psnr_db, delta_ssim, delta_lpips, wall_secs, modal_cost_cents, output_url}.
Each matrix entry writes the cell JSON to benches/timeseries/<commit>/<preset>__<scene>.json.
An aggregator job collates all cells into benches/timeseries/<commit>.json (time-series row) and refreshes the continuousBench block of benches/reports/splatbench-v0.json (powers the dashboard).
The aggregator applies the ≤0.3 dB PSNR drop rule: if a cell's new ΔPSNR is >0.3 dB worse than the previously-recorded value, the leaderboard keeps the old value, and the regression is recorded in continuousBench.lastRejections. The per-commit time-series always records the new number — we don't silently hide regressions.
Aggregator commits the time-series + report refresh back to main as splatbench-bot <bench@catetus.com>.

2. Why CI-driven

Anyone reading the bench can audit the workflow file, the dispatcher (benches/splatbench-v3-dispatch.py), and the aggregator (benches/splatbench-v3-aggregate.mjs). The Modal endpoints they call live in separate apps the worker forwards to — same code path that customers hit at /try-it. There is no special "bench build" of the codecs; you're measuring the production encoder.

The previous bench (SplatBench v0) was hand-curated: a human ran the encoders locally, parsed the output, and pasted into a JSON. v3 removes that step so credibility scales — adding a column doesn't cost a human afternoon, and there's no temptation to skip a scene that didn't look good.

3. Determinism & reproducibility

The matrix is idempotent on the same commit within a small tolerance:

Scene hashes are pinned in manifest.json. First successful workflow run computes the blake3 of the resolved PLY; subsequent runs verify the hash matches before encoding.
Per-cell run id is deterministic: sha256(commit | preset | scene)[:16]. Modal correlation IDs and on-disk artifacts use this so re-running on the same commit finds (and may reuse) prior work.
Encoder seeds are pinned at the preset level: every trainer (Scaffold-GS, vanilla 3DGS, HAC++) uses seed=42 for the QAT retrain pass.
Stochastic tolerance: re-runs on the same commit should reproduce numbers within ±0.01 dB. The aggregator's drop guard fires at 0.3 dB so stochastic flapping never causes a false positive.

4. Adding a preset

Deploy the encoder as a Modal app. The entrypoint must accept (preset, blob_url, filename, run_id) and return the cell envelope.
Register the dispatch URL in apps/worker/worker.py:PRESET_DISPATCH_URLS and the preset value in benches/scenes/manifest.json:presets.
Add the (Modal-app, function) tuple to benches/splatbench-v3-dispatch.py:PRESET_TO_MODAL.
Open a PR; the regression-alert workflow will post a comment showing what cells the new preset will fill on merge.

5. Adding a scene

Append an entry to benches/scenes/manifest.json:

{
  "id": "myscene_iter7k",
  "splatCount": 0,
  "bytesIn": 0,
  "shDegree": 3,
  "hash": "pending:first-bench-run",
  "sourceUrl": "https://your-host/point_cloud.ply",
  "license": "CC-BY-4.0 or compatible open license",
  "class": "outdoor-scene",
  "evalCameras": 24,
  "dataset": "your-dataset",
  "presetAllowList": null
}

The first successful workflow run pins the hash + splat count + byte size. Scenes with restrictive licenses go through presetAllowList to skip presets that re-upload the PLY publicly.

6. Budget & safety

Each full matrix run is bounded by the Modal workspace concurrency cap (~10 parallel A100 slots) and the workspace $50 cap. A typical full run (7 presets × 14 scenes = 98 cells) with the current preset mix costs $30–$60 depending on which premium-tier QAT cells fire. The workflow can be disabled at any time via gh workflow disable splatbench-v3.