Benchmark Methodology

How RenderScope benchmarks are designed, conducted, and validated — ensuring fair, reproducible, and meaningful comparisons across rendering engines.

Philosophy

Rendering engine benchmarks are only useful if they are trustworthy. RenderScope's benchmark methodology is built on four core principles:

Fairness

Every renderer gets the same scene, the same resolution, the same hardware, and the same conditions. No renderer is given advantages through settings that favor its strengths.

Reproducibility

Every published benchmark includes full hardware specs, software versions, exact commands, and settings. Anyone can replicate the results.

Transparency

The methodology is public, the tools are open source, the raw data is downloadable. Nothing is hidden.

Honesty

Benchmarks have limitations, and we acknowledge them explicitly. See the Limitations section below.

Hardware Requirements

Benchmarks are grouped by hardware profile. Each benchmark result is tagged with the exact hardware it was run on. Results from different hardware profiles are never directly compared in the same chart — they are always labeled and grouped by profile.

Example Hardware Profile

CPU: AMD Ryzen 9 7950X (16 cores / 32 threads)

GPU: NVIDIA RTX 4090 (24 GB VRAM)

RAM: 64 GB DDR5-5600

OS: Ubuntu 22.04 LTS

Driver: NVIDIA 545.29.06, CUDA 12.3

Settings Standardization

All benchmarks use the following default parameters unless the renderer physically cannot operate with them:

Parameter	Default Value	Rationale
Resolution	1920 × 1080	Industry standard HD; stresses renderers without being extreme
Sample Count (path tracers)	1024 spp	High enough for convergence on most scenes; standard in research
Time Budget (real-time)	60 s warmup + 10 s measured	Allows GPU caches to stabilize
Integrator	Renderer default path tracer	Fairest comparison — each renderer’s best general-purpose integrator
Denoiser	Disabled	Denoisers vary wildly; comparing raw output is more meaningful
Tone Mapping	Linear (no tone mapping)	Metrics must operate on linear data; sRGB is applied only for display
Thread Count	All available cores	Maximizes each renderer’s performance
GPU	Enabled when supported	Tests the renderer’s best available path

Renderer-specific overrides are documented per benchmark result and are only applied when a renderer physically cannot operate with the default settings (e.g., a renderer that only supports specific sample counts).

Fairness Protocol

Every benchmark run follows this protocol to minimize external variance and ensure comparable results:

Exclusive system access

No other computationally intensive processes running during benchmarks. Close all browsers, IDEs, and background services.

Thermal equilibrium

Run a 5-minute warmup render (discarded) before timing begins. This ensures the CPU/GPU has reached its steady-state thermal performance, avoiding turbo-boost skew.

Multiple runs

Each benchmark is run a minimum of 3 times. The reported value is the median of all runs (robust to outliers). Standard deviation is also recorded.

Fresh process

Each run starts a fresh renderer process. No caching between runs.

Sequential execution

Renderers are benchmarked one at a time, never concurrently.

Version pinning

The exact version (commit hash or release tag) of each renderer is recorded and reported.

Metric Definitions

PSNR (Peak Signal-to-Noise Ratio)

Measures pixel-level fidelity between a test image and a reference image. Higher is better.

Formula

PSNR = 10 \u00b7 log\u2081\u2080(MAX\u00b2 / MSE)

Where MAX is the maximum pixel value (1.0 for float images, 255 for 8-bit) and MSE is the Mean Squared Error.

Range: 0 to \u221e dB. Typical values: 20\u201325 dB (visible differences), 30\u201340 dB (good quality), 40+ dB (excellent).

Best for: Quick numerical comparison. Widely understood in the research community.

Limitation: Doesn't correlate perfectly with perceived visual quality — a small bright pixel shift can significantly reduce PSNR while being visually imperceptible.

SSIM (Structural Similarity Index)

Measures perceived structural similarity, accounting for luminance, contrast, and structure. Closer to 1.0 is better.

Formula

SSIM(x,y) = (2\u03bc\u2093\u03bc\u1d67 + C\u2081)(2\u03c3\u2093\u1d67 + C\u2082) / (\u03bc\u2093\u00b2 + \u03bc\u1d67\u00b2 + C\u2081)(\u03c3\u2093\u00b2 + \u03c3\u1d67\u00b2 + C\u2082)

Where \u03bc is the mean, \u03c3 is variance/covariance, and C\u2081, C\u2082 are stabilization constants.

Range: \u22121 to 1 (typically 0 to 1). Values above 0.95 are generally considered excellent.

Best for: Perceptually-aligned comparison. Better than PSNR at predicting what humans notice.

Implementation: Computed using scikit-image's structural_similarity with default parameters (window size 7, Gaussian weights).

LPIPS (Learned Perceptual Image Patch Similarity)

Uses a deep neural network (VGG or AlexNet) to compare images in a perceptual feature space. Lower is better.

Range: 0 to ~1. Values below 0.1 indicate high similarity.

Best for: The most perceptually accurate metric available. Especially useful for comparing different rendering algorithms that produce structurally different but visually similar results.

Requirement: Requires PyTorch. Install via pip install renderscope[ml].

Implementation: Uses the torchmetrics LPIPS implementation with AlexNet backbone (default).

MSE (Mean Squared Error)

Average of squared pixel differences. Lower is better. The simplest image quality metric.

Formula

MSE = (1/N) \u00b7 \u03a3(I\u1d63\u1d49\u1da0 \u2212 I\u209c\u1d49\u209b\u209c)\u00b2

Best for: Raw numerical comparison and intermediate computation (PSNR is derived from MSE).

Limitation: Highly sensitive to outlier pixels; not perceptually meaningful on its own.

Render Time

Wall-clock time from render start to completion, in seconds. Excludes scene loading and file I/O.

Measurement: Captured via time.perf_counter() wrapping the renderer's core render call.

Reported as: Median of 3+ runs, with standard deviation.

Peak Memory

Maximum resident memory (RSS) during the render process, in megabytes.

Measurement: Sampled every 100ms via psutil.Process().memory_info().rss.

Note: GPU memory is reported separately when available (via nvidia-smi or equivalent).

Reproducibility

Every published benchmark can be reproduced by following these steps:

Identify the benchmark

Each benchmark result on the website includes a unique ID and links to this methodology page.

Install the CLI

pip install renderscope

Install the renderer(s)

Follow the renderer's own installation instructions, linked from its RenderScope profile page.

Download the scene

renderscope download-scenes --scene cornell-box

Run the benchmark with matching settings

renderscope benchmark --scene cornell-box --renderer pbrt --samples 1024 --width 1920 --height 1080

Compare results

The expected output includes render time, memory, and image metrics against the reference.

If your results differ significantly from published benchmarks, this is expected for different hardware. File an issue only if results differ on the same hardware profile.

Limitations & Caveats

Acknowledging what the benchmarks do not measure is as important as explaining what they do. This transparency is essential for correct interpretation of results.

Scene coverage

The standard scenes (Cornell Box, Sponza, Stanford Bunny, Classroom, BMW, San Miguel, Veach MIS) test specific aspects of rendering but cannot cover every scenario. Results may not generalize to production workloads.

Feature asymmetry

Some renderers support features (volumetrics, subsurface scattering, spectral rendering) that others don’t. Benchmarks test common-denominator capabilities unless otherwise noted.

Configuration fairness

While we use default integrator settings, some renderers may have better non-default configurations. Expert tuning is out of scope.

Real-time vs. offline

Comparing a 60-second path trace to a 16ms rasterization frame is fundamentally different. We separate these in the benchmark dashboard and note the rendering paradigm.

Neural renderer specifics

Neural renderers (NeRF, 3DGS) require per-scene training, which is fundamentally different from classical rendering. Training time and rendering time are reported separately.

Back to Docs