SplatBench v3 · methodology

How the continuous bench is run

The numbers on /bench/history and /bench are not hand-curated. Every codec preset is run against every scene in the corpus on every commit to public main. Below is the run model end-to-end, the determinism strategy, and the PR template for adding a preset or scene.

1. Run model

  1. On push to main, GitHub Actions (.github/workflows/splatbench-v3.yml) reads benches/scenes/manifest.json and generates a matrix of every (preset, scene) cell.
  2. Each matrix entry calls the preset's Modal endpoint via modal.Function.lookup(app, fn).remote(...), passing the scene's sourceUrl and a stable per-cell run id (sha256(commit|preset|scene)).
  3. The Modal endpoint downloads the PLY, runs the codec on A100, uploads the compressed artifact, and returns {ply_save_pct, delta_psnr_db, delta_ssim, delta_lpips, wall_secs, modal_cost_cents, output_url}.
  4. Each matrix entry writes the cell JSON to benches/timeseries/<commit>/<preset>__<scene>.json.
  5. An aggregator job collates all cells into benches/timeseries/<commit>.json (time-series row) and refreshes the continuousBench block of benches/reports/splatbench-v0.json (powers the dashboard).
  6. The aggregator applies the ≤0.3 dB PSNR drop rule: if a cell's new ΔPSNR is >0.3 dB worse than the previously-recorded value, the leaderboard keeps the old value, and the regression is recorded in continuousBench.lastRejections. The per-commit time-series always records the new number — we don't silently hide regressions.
  7. Aggregator commits the time-series + report refresh back to main as splatbench-bot <bench@splatforge.dev>.

2. Why CI-driven

Anyone reading the bench can audit the workflow file, the dispatcher (benches/splatbench-v3-dispatch.py), and the aggregator (benches/splatbench-v3-aggregate.mjs). The Modal endpoints they call live in separate apps the worker forwards to — same code path that customers hit at /try-it. There is no special "bench build" of the codecs; you're measuring the production encoder.

The previous bench (SplatBench v0) was hand-curated: a human ran the encoders locally, parsed the output, and pasted into a JSON. v3 removes that step so credibility scales — adding a column doesn't cost a human afternoon, and there's no temptation to skip a scene that didn't look good.

3. Determinism & reproducibility

The matrix is idempotent on the same commit within a small tolerance:

4. Adding a preset

  1. Deploy the encoder as a Modal app. The entrypoint must accept (preset, blob_url, filename, run_id) and return the cell envelope.
  2. Register the dispatch URL in apps/worker/worker.py:PRESET_DISPATCH_URLS and the preset value in benches/scenes/manifest.json:presets.
  3. Add the (Modal-app, function) tuple to benches/splatbench-v3-dispatch.py:PRESET_TO_MODAL.
  4. Open a PR; the regression-alert workflow will post a comment showing what cells the new preset will fill on merge.

5. Adding a scene

Append an entry to benches/scenes/manifest.json:

{
  "id": "myscene_iter7k",
  "splatCount": 0,
  "bytesIn": 0,
  "shDegree": 3,
  "hash": "pending:first-bench-run",
  "sourceUrl": "https://your-host/point_cloud.ply",
  "license": "CC-BY-4.0 or compatible open license",
  "class": "outdoor-scene",
  "evalCameras": 24,
  "dataset": "your-dataset",
  "presetAllowList": null
}

The first successful workflow run pins the hash + splat count + byte size. Scenes with restrictive licenses go through presetAllowList to skip presets that re-upload the PLY publicly.

6. Budget & safety

Each full matrix run is bounded by the Modal workspace concurrency cap (~10 parallel A100 slots) and the workspace $50 cap. A typical full run (7 presets × 14 scenes = 98 cells) with the current preset mix costs $30–$60 depending on which premium-tier QAT cells fire. The workflow can be disabled at any time via gh workflow disable splatbench-v3.