Benchmark Methodology
How RenderScope benchmarks are designed, conducted, and validated — ensuring fair, reproducible, and meaningful comparisons across rendering engines.
Philosophy
Rendering engine benchmarks are only useful if they are trustworthy. RenderScope's benchmark methodology is built on four core principles:
Fairness
Every renderer gets the same scene, the same resolution, the same hardware, and the same conditions. No renderer is given advantages through settings that favor its strengths.
Reproducibility
Every published benchmark includes full hardware specs, software versions, exact commands, and settings. Anyone can replicate the results.
Transparency
The methodology is public, the tools are open source, the raw data is downloadable. Nothing is hidden.
Honesty
Benchmarks have limitations, and we acknowledge them explicitly. See the Limitations section below.
Hardware Requirements
Benchmarks are grouped by hardware profile. Each benchmark result is tagged with the exact hardware it was run on. Results from different hardware profiles are never directly compared in the same chart — they are always labeled and grouped by profile.
Example Hardware Profile
CPU: AMD Ryzen 9 7950X (16 cores / 32 threads)
GPU: NVIDIA RTX 4090 (24 GB VRAM)
RAM: 64 GB DDR5-5600
OS: Ubuntu 22.04 LTS
Driver: NVIDIA 545.29.06, CUDA 12.3
Settings Standardization
All benchmarks use the following default parameters unless the renderer physically cannot operate with them:
| Parameter | Default Value | Rationale |
|---|---|---|
| Resolution | 1920 × 1080 | Industry standard HD; stresses renderers without being extreme |
| Sample Count (path tracers) | 1024 spp | High enough for convergence on most scenes; standard in research |
| Time Budget (real-time) | 60 s warmup + 10 s measured | Allows GPU caches to stabilize |
| Integrator | Renderer default path tracer | Fairest comparison — each renderer’s best general-purpose integrator |
| Denoiser | Disabled | Denoisers vary wildly; comparing raw output is more meaningful |
| Tone Mapping | Linear (no tone mapping) | Metrics must operate on linear data; sRGB is applied only for display |
| Thread Count | All available cores | Maximizes each renderer’s performance |
| GPU | Enabled when supported | Tests the renderer’s best available path |
Fairness Protocol
Every benchmark run follows this protocol to minimize external variance and ensure comparable results:
Exclusive system access
No other computationally intensive processes running during benchmarks. Close all browsers, IDEs, and background services.
Thermal equilibrium
Run a 5-minute warmup render (discarded) before timing begins. This ensures the CPU/GPU has reached its steady-state thermal performance, avoiding turbo-boost skew.
Multiple runs
Each benchmark is run a minimum of 3 times. The reported value is the median of all runs (robust to outliers). Standard deviation is also recorded.
Fresh process
Each run starts a fresh renderer process. No caching between runs.
Sequential execution
Renderers are benchmarked one at a time, never concurrently.
Version pinning
The exact version (commit hash or release tag) of each renderer is recorded and reported.
Metric Definitions
PSNR (Peak Signal-to-Noise Ratio)
Measures pixel-level fidelity between a test image and a reference image. Higher is better.
Formula
PSNR = 10 \u00b7 log\u2081\u2080(MAX\u00b2 / MSE)
Where MAX is the maximum pixel value (1.0 for float images, 255 for 8-bit) and MSE is the Mean Squared Error.
Range: 0 to \u221e dB. Typical values: 20\u201325 dB (visible differences), 30\u201340 dB (good quality), 40+ dB (excellent).
Best for: Quick numerical comparison. Widely understood in the research community.
Limitation: Doesn't correlate perfectly with perceived visual quality — a small bright pixel shift can significantly reduce PSNR while being visually imperceptible.
SSIM (Structural Similarity Index)
Measures perceived structural similarity, accounting for luminance, contrast, and structure. Closer to 1.0 is better.
Formula
SSIM(x,y) = (2\u03bc\u2093\u03bc\u1d67 + C\u2081)(2\u03c3\u2093\u1d67 + C\u2082) / (\u03bc\u2093\u00b2 + \u03bc\u1d67\u00b2 + C\u2081)(\u03c3\u2093\u00b2 + \u03c3\u1d67\u00b2 + C\u2082)
Where \u03bc is the mean, \u03c3 is variance/covariance, and C\u2081, C\u2082 are stabilization constants.
Range: \u22121 to 1 (typically 0 to 1). Values above 0.95 are generally considered excellent.
Best for: Perceptually-aligned comparison. Better than PSNR at predicting what humans notice.
Implementation: Computed using scikit-image's structural_similarity with default parameters (window size 7, Gaussian weights).
LPIPS (Learned Perceptual Image Patch Similarity)
Uses a deep neural network (VGG or AlexNet) to compare images in a perceptual feature space. Lower is better.
Range: 0 to ~1. Values below 0.1 indicate high similarity.
Best for: The most perceptually accurate metric available. Especially useful for comparing different rendering algorithms that produce structurally different but visually similar results.
Requirement: Requires PyTorch. Install via pip install renderscope[ml].
Implementation: Uses the torchmetrics LPIPS implementation with AlexNet backbone (default).
MSE (Mean Squared Error)
Average of squared pixel differences. Lower is better. The simplest image quality metric.
Formula
MSE = (1/N) \u00b7 \u03a3(I\u1d63\u1d49\u1da0 \u2212 I\u209c\u1d49\u209b\u209c)\u00b2
Best for: Raw numerical comparison and intermediate computation (PSNR is derived from MSE).
Limitation: Highly sensitive to outlier pixels; not perceptually meaningful on its own.
Render Time
Wall-clock time from render start to completion, in seconds. Excludes scene loading and file I/O.
Measurement: Captured via time.perf_counter() wrapping the renderer's core render call.
Reported as: Median of 3+ runs, with standard deviation.
Peak Memory
Maximum resident memory (RSS) during the render process, in megabytes.
Measurement: Sampled every 100ms via psutil.Process().memory_info().rss.
Note: GPU memory is reported separately when available (via nvidia-smi or equivalent).
Reproducibility
Every published benchmark can be reproduced by following these steps:
Identify the benchmark
Each benchmark result on the website includes a unique ID and links to this methodology page.
Install the CLI
pip install renderscopeInstall the renderer(s)
Follow the renderer's own installation instructions, linked from its RenderScope profile page.
Download the scene
renderscope download-scenes --scene cornell-boxRun the benchmark with matching settings
renderscope benchmark --scene cornell-box --renderer pbrt --samples 1024 --width 1920 --height 1080Compare results
The expected output includes render time, memory, and image metrics against the reference.
Limitations & Caveats
Acknowledging what the benchmarks do not measure is as important as explaining what they do. This transparency is essential for correct interpretation of results.
Scene coverage
The standard scenes (Cornell Box, Sponza, Stanford Bunny, Classroom, BMW, San Miguel, Veach MIS) test specific aspects of rendering but cannot cover every scenario. Results may not generalize to production workloads.
Feature asymmetry
Some renderers support features (volumetrics, subsurface scattering, spectral rendering) that others don’t. Benchmarks test common-denominator capabilities unless otherwise noted.
Configuration fairness
While we use default integrator settings, some renderers may have better non-default configurations. Expert tuning is out of scope.
Real-time vs. offline
Comparing a 60-second path trace to a 16ms rasterization frame is fundamentally different. We separate these in the benchmark dashboard and note the rendering paradigm.
Neural renderer specifics
Neural renderers (NeRF, 3DGS) require per-scene training, which is fundamentally different from classical rendering. Training time and rendering time are reported separately.