GitHub - RealTimeChris/benchmarksuite: A suite of benchmarks. (original) (raw)

Benchmark Suite

Hello and welcome to bnch_swt or "Benchmark Suite". This is a collection of classes/functions for the purpose of benchmarking CPU and GPU performance.

The following operating systems and compilers are officially supported:

Compiler Support

Operating System Support

Quickstart Guide for benchmarksuite

This guide will walk you through setting up and running benchmarks using benchmarksuite.

Installation
Basic Example
Creating Benchmarks
CPU vs GPU Benchmarking
Running Benchmarks
- Common CMake Options
- Complete Project Example
Output and Results
Features
API Conventions
Migrating from Pre-1.0.0

Installation

Method 1: vcpkg + CMake (Recommended)

Step 1: Add to vcpkg.json

Create or update your vcpkg.json in your project root:

{ "name": "your-project-name", "version": "1.0.0", "dependencies": [ "benchmarksuite" ] }

Step 2: Configure CMake

In your CMakeLists.txt:

cmake_minimum_required(VERSION 3.20) project(YourProject LANGUAGES CXX CUDA) # Add CUDA if using GPU benchmarks

Set C++ standard

set(CMAKE_CXX_STANDARD 23) set(CMAKE_CXX_STANDARD_REQUIRED ON)

For CUDA support (optional)

set(CMAKE_CUDA_STANDARD 20) set(CMAKE_CUDA_STANDARD_REQUIRED ON)

Find the package

find_package(benchmarksuite CONFIG REQUIRED)

Create your executable

add_executable(your_benchmark main.cpp)

Link against benchmarksuite (header-only, just sets up includes)

target_link_libraries(your_benchmark PRIVATE benchmarksuite::benchmarksuite)

If using CUDA

set_target_properties(your_benchmark PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

Step 3: Configure with vcpkg toolchain

Configure

cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake

Build

cmake --build build --config Release

Step 4: Include in your code

#include <bnch_swt/index.hpp>

int main() { // Your benchmarks here return 0; }

Method 2: Manual Installation

If not using vcpkg, you can include benchmarksuite as a header-only library:

Step 1: Clone the repository

git clone https://github.com/RealTimeChris/benchmarksuite.git

Step 2: Add to CMake

Add as subdirectory

add_subdirectory(path/to/benchmarksuite)

Or set include directory

target_include_directories(your_target PRIVATE path/to/benchmarksuite/include)

Step 3: Include headers

#include <bnch_swt/index.hpp>

Requirements

To use benchmarksuite, ensure you have a C++23 (or later) compliant compiler.

For CPU Benchmarking:

MSVC 2022 or later
GCC 13 or later
Clang 16 or later

For GPU/CUDA Benchmarking:

NVIDIA CUDA Toolkit 11.0 or later
NVCC compiler
CUDA-capable GPU

Platform-Specific Notes

Windows:

Use Visual Studio 2022 or later
For CUDA: Install CUDA Toolkit from NVIDIA

Linux:

Install build essentials: sudo apt-get install build-essential
For CUDA: Install CUDA Toolkit via package manager or NVIDIA installer

macOS:

Install Xcode Command Line Tools
CUDA support not available on Apple Silicon (M1/M2/M3)

Verification

Verify your installation with a simple test:

#include <bnch_swt/index.hpp> #include

int main() { std::cout << "benchmarksuite successfully installed!" << std::endl; return 0; }

Basic Example

The following example demonstrates how to set up and run a benchmark comparing two integer-to-string conversion functions:

// Define benchmark functions as structs with static impl() methods struct glz_to_chars_benchmark { BNCH_SWT_HOST static uint64_t impl(std::vector& test_values, std::vectorstd::string& test_values_00, std::vectorstd::string& test_values_01) { uint64_t bytes_processed = 0; char newer_string[30]{}; for (uint64_t x = 0; x < test_values.size(); ++x) { std::memset(newer_string, '\0', sizeof(newer_string)); auto new_ptr = glz::to_chars(newer_string, test_values[x]); bytes_processed += test_values_00[x].size(); test_values_01[x] = std::string{newer_string, static_cast(new_ptr - newer_string)}; } return bytes_processed; } };

struct jsonifier_to_chars_benchmark { BNCH_SWT_HOST static uint64_t impl(std::vector& test_values, std::vectorstd::string& test_values_00, std::vectorstd::string& test_values_01) { uint64_t bytes_processed = 0; char newer_string[30]{}; for (uint64_t x = 0; x < test_values.size(); ++x) { std::memset(newer_string, '\0', sizeof(newer_string)); auto new_ptr = jsonifier_internal::to_chars(newer_string, test_values[x]); bytes_processed += test_values_00[x].size(); test_values_01[x] = std::string{newer_string, static_cast(new_ptr - newer_string)}; } return bytes_processed; } };

int main() { constexpr uint64_t count = 512;

// Setup test data
std::vector<int64_t> test_values = generate_random_integers<int64_t>(count, 20);
std::vector<std::string> test_values_00;
std::vector<std::string> test_values_01(count);

for (uint64_t x = 0; x < count; ++x) {
    test_values_00.emplace_back(std::to_string(test_values[x]));
}

// Define benchmark stage with 200 total iterations, 25 measured, CPU benchmarking
using benchmark = bnch_swt::benchmark_stage<"int-to-string-comparison", 200, 25, 
                                             bnch_swt::benchmark_types::cpu>;

// Run benchmarks
benchmark::run_benchmark<"glz::to_chars", glz_to_chars_benchmark>(test_values, test_values_00, test_values_01);
benchmark::run_benchmark<"jsonifier::to_chars", jsonifier_to_chars_benchmark>(test_values, test_values_00, test_values_01);

// Print results with comparison
benchmark::print_results(true, true);

return 0;

}

Creating Benchmarks

To create a benchmark:

Define your benchmark functions as structs with a static impl() method that returns uint64_t (bytes processed)
Use bnch_swt::benchmark_stage with appropriate template parameters
Call run_benchmark with your benchmark struct and any required arguments

Benchmark Stage

The benchmark_stage structure orchestrates each test and supports both CPU and GPU benchmarking:

// Full template signature template<bnch_swt::string_literal stage_name, // Required: benchmark stage name uint64_t max_execution_count = 200, // Total iterations (warmup + measured) uint64_t measured_iteration_count = 25, // Iterations to measure bnch_swt::benchmark_types benchmark_type = bnch_swt::benchmark_types::cpu, // CPU or CUDA bool clear_cpu_cache_between_each_iteration = false, // Cache clearing flag bnch_swt::string_literal metric_name = bnch_swt::string_literal<1>{} // Custom metric name

struct benchmark_stage;

// Common usage examples using cpu_benchmark = bnch_swt::benchmark_stage<"my-benchmark">; // Uses defaults: 200 total, 25 measured, CPU using gpu_benchmark = bnch_swt::benchmark_stage<"gpu-test", 100, 10, bnch_swt::benchmark_types::cuda>; using custom_metric = bnch_swt::benchmark_stage<"compression", 200, 25, bnch_swt::benchmark_types::cpu, false, "compression-ratio">;

Template Parameters

stage_name (required): String literal identifying the benchmark stage
max_execution_count (default 200): Total number of iterations including warmup
measured_iteration_count (default 25): Number of iterations to measure for final metrics
benchmark_type (default cpu): bnch_swt::benchmark_types::cpu or bnch_swt::benchmark_types::cuda
clear_cpu_cache_between_each_iteration (default false): Whether to clear CPU caches between iterations
metric_name (default empty): Custom metric name for specialized benchmarks (e.g., compression ratios)

Methods

run_benchmark<name, function_type>(args...): Executes the benchmark function's impl() method with the provided arguments
- name: String literal identifying this specific benchmark within the stage
- function_type: Struct type with a static impl() method
- For CPU: run_benchmark<name, function_type>(args...) where args are forwarded to impl()
- For CUDA: run_benchmark<name, function_type>(grid, block, shared_mem, bytes_processed, args...) where:
  * grid: dim3 specifying grid dimensions
  * block: dim3 specifying block dimensions
  * shared_mem: uint64_t bytes of shared memory
  * bytes_processed: uint64_t bytes processed for throughput calculation
  * args...: Additional arguments forwarded to kernel impl()
- Returns: performance_metrics<benchmark_type> object
print_results(show_comparison = true, show_metrics = true): Displays performance metrics and comparisons
- show_comparison: Whether to show head-to-head comparisons between benchmarks
- show_metrics: Whether to show detailed hardware counter metrics
get_results(): Returns a sorted vector of all performance_metrics for programmatic access

Benchmark Function Requirements

Benchmark functions must be defined as structs with a static impl() method:

For CPU benchmarks:

struct my_cpu_benchmark { BNCH_SWT_HOST static uint64_t impl(/* your parameters /) { // Your CPU code to benchmark uint64_t bytes_processed = / calculate bytes */; return bytes_processed; // Must return bytes processed } };

For CUDA benchmarks:

struct my_cuda_benchmark { BNCH_SWT_DEVICE static void impl(/* your parameters */) { // Your CUDA kernel code (runs on device) // This code will be wrapped in a kernel launch by the framework int idx = blockIdx.x * blockDim.x + threadIdx.x; // ... your kernel logic } };

Key differences:

CPU: impl() returns uint64_t (bytes processed) and uses BNCH_SWT_HOST
CUDA: impl() returns void, uses BNCH_SWT_DEVICE, and contains kernel code (not a kernel launch)
CUDA: Bytes processed is passed as a parameter to run_benchmark(), not returned from impl()
CUDA: The framework automatically wraps your impl() in a kernel launch with the specified grid/block dimensions

CPU vs GPU Benchmarking

As of v1.0.0, benchmarksuite supports both CPU and GPU (CUDA) benchmarking through the benchmark_types enum.

CPU Benchmarks

// Define CPU benchmark function struct cpu_computation_benchmark { BNCH_SWT_HOST static uint64_t impl(const std::vector& input, std::vector& output) { // Your CPU computation here for (size_t i = 0; i < input.size(); ++i) { output[i] = std::sqrt(input[i] * input[i] + 1.0f); } // Return bytes processed for throughput calculation return input.size() * sizeof(float); } };

// Create CPU benchmark stage (200 total iterations, 25 measured, CPU type) using cpu_stage = bnch_swt::benchmark_stage<"cpu-test", 200, 25, bnch_swt::benchmark_types::cpu>;

// Setup data constexpr size_t data_size = 1024 * 1024; std::vector input(data_size, 1.0f); std::vector output(data_size);

// Run the benchmark cpu_stage::run_benchmark<"my-cpu-function", cpu_computation_benchmark>(input, output);

// Print results cpu_stage::print_results();

GPU/CUDA Benchmarks

// Define CUDA kernel benchmark struct cuda_kernel_benchmark { BNCH_SWT_DEVICE static void impl(float* data, uint64_t size) { // Your CUDA kernel code here // This runs inside the kernel, NOT as a kernel launch int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { data[idx] = data[idx] * 2.0f; // Example operation } } };

// Create CUDA benchmark stage using cuda_stage = bnch_swt::benchmark_stage<"gpu-test", 100, 10, bnch_swt::benchmark_types::cuda>;

// Setup constexpr uint64_t data_size = 1024 * 1024; float* gpu_data; cudaMalloc(&gpu_data, data_size * sizeof(float));

// Configure kernel launch parameters dim3 grid{256, 1, 1}; dim3 block{256, 1, 1}; uint64_t shared_memory = 0; uint64_t bytes_processed = data_size * sizeof(float);

// Run CUDA benchmark // Parameters: grid, block, shared_mem, bytes_processed, then your kernel args cuda_stage::run_benchmark<"my-cuda-kernel", cuda_kernel_benchmark>( grid, block, shared_memory, bytes_processed, gpu_data, data_size );

cuda_stage::print_results(); cudaFree(gpu_data);

Important: For CUDA benchmarks, the impl() method contains the kernel code itself (not a kernel launch). The benchmarking framework wraps it in a kernel launch using the provided grid/block dimensions.

Mixed CPU/GPU Benchmarking

You can benchmark CPU and GPU implementations side-by-side:

constexpr uint64_t data_size = 1024 * 1024;

// CPU benchmark function struct cpu_process_benchmark { BNCH_SWT_HOST static uint64_t impl(std::vector& cpu_data) { // Process data on CPU for (size_t i = 0; i < cpu_data.size(); ++i) { cpu_data[i] = cpu_data[i] * 2.0f; } return cpu_data.size() * sizeof(float); } };

// GPU benchmark function (kernel code, NOT kernel launch) struct gpu_process_benchmark { BNCH_SWT_DEVICE static void impl(float* gpu_data, uint64_t size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { gpu_data[idx] = gpu_data[idx] * 2.0f; } } };

// Setup test data std::vector cpu_data(data_size); float* gpu_data; cudaMalloc(&gpu_data, data_size * sizeof(float));

// CPU version using cpu_test = bnch_swt::benchmark_stage<"cpu-vs-gpu", 100, 10, bnch_swt::benchmark_types::cpu>; cpu_test::run_benchmark<"cpu-version", cpu_process_benchmark>(cpu_data);

// GPU version
using gpu_test = bnch_swt::benchmark_stage<"cpu-vs-gpu", 100, 10, bnch_swt::benchmark_types::cuda>; dim3 grid{(data_size + 255) / 256, 1, 1}; dim3 block{256, 1, 1}; gpu_test::run_benchmark<"gpu-version", gpu_process_benchmark>( grid, block, 0, data_size * sizeof(float), gpu_data, data_size );

// Print both results for comparison cpu_test::print_results(); gpu_test::print_results();

cudaFree(gpu_data);

This allows direct performance comparison between CPU and GPU implementations of the same algorithm.

Cache Clearing Option

For more accurate CPU benchmarks, you can enable cache clearing between iterations:

// Enable cache clearing (5th template parameter) using cache_cleared = bnch_swt::benchmark_stage<"cache-test", 200, 25, bnch_swt::benchmark_types::cpu, true>;

This is useful when benchmarking memory-bound operations where you want to measure cold cache performance.

Custom Metrics

You can specify custom metric names for specialized benchmarks that don't measure traditional throughput:

// Compression benchmark with custom metric name using compression_bench = bnch_swt::benchmark_stage<"compression-test", 200, 25, bnch_swt::benchmark_types::cpu, false, "compression-ratio">;

struct compress_benchmark { BNCH_SWT_HOST static uint64_t impl(const std::vector& input) { auto compressed = compress_data(input); // Return custom metric value (e.g., compression ratio * 1000) return (input.size() * 1000) / compressed.size(); } };

compression_bench::run_benchmark<"my-compressor", compress_benchmark>(input_data); compression_bench::print_results();

When a custom metric name is provided, the results will display your custom metric instead of standard MB/s throughput.

Running Benchmarks

With vcpkg + CMake (recommended):

Configure with vcpkg toolchain

cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake -DCMAKE_BUILD_TYPE=Release

Build

cmake --build build --config Release

Run

./build/your_benchmark # Linux/macOS .\build\Release\your_benchmark.exe # Windows

Manual CMake build:

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release ./build/your_benchmark

For CUDA benchmarks, ensure CUDA is enabled:

cmake -B build -S .
-DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_CUDA_ARCHITECTURES=86 # Adjust for your GPU architecture

cmake --build build --config Release

Common CMake Options

-DCMAKE_BUILD_TYPE=Release - Build optimized release version
-DCMAKE_CUDA_ARCHITECTURES=86 - Target specific CUDA compute capability (e.g., 86 for RTX 30xx/40xx)
-DCMAKE_CXX_COMPILER=clang++ - Specify C++ compiler
-DCMAKE_CUDA_COMPILER=nvcc - Specify CUDA compiler

Complete Project Example

Project structure:

my-benchmark/
├── CMakeLists.txt
├── vcpkg.json
├── main.cpp
└── benchmarks/
    ├── cpu_benchmark.hpp
    └── gpu_benchmark.cuh

CMakeLists.txt:

cmake_minimum_required(VERSION 3.20) project(MyBenchmark LANGUAGES CXX CUDA)

C++23 required

set(CMAKE_CXX_STANDARD 23) set(CMAKE_CXX_STANDARD_REQUIRED ON)

CUDA 20 for GPU benchmarks

set(CMAKE_CUDA_STANDARD 20) set(CMAKE_CUDA_STANDARD_REQUIRED ON)

Find benchmarksuite

find_package(benchmarksuite CONFIG REQUIRED)

Create executable

add_executable(my_benchmark main.cpp benchmarks/cpu_benchmark.hpp benchmarks/gpu_benchmark.cuh )

Link benchmarksuite

target_link_libraries(my_benchmark PRIVATE benchmarksuite::benchmarksuite )

CUDA properties

set_target_properties(my_benchmark PROPERTIES CUDA_SEPARABLE_COMPILATION ON CUDA_RESOLVE_DEVICE_SYMBOLS ON )

Optimization flags

if(MSVC) target_compile_options(my_benchmark PRIVATE /O2 /arch:AVX2) else() target_compile_options(my_benchmark PRIVATE -O3 -march=native) endif()

vcpkg.json:

{ "name": "my-benchmark", "version": "1.0.0", "dependencies": [ "benchmarksuite" ] }

Output and Results

Performance Metrics for: int-to-string-comparisons-1
Metrics for: benchmarksuite::internal::to_chars
Total Iterations to Stabilize                               : 394
Measured Iterations                                         : 20
Bytes Processed                                             : 512.00
Nanoseconds per Execution                                   : 5785.25
Frequency (GHz)                                             : 4.83
Throughput (MB/s)                                           : 84.58
Throughput Percentage Deviation (+/-%)                      : 8.36
Cycles per Execution                                        : 27921.20
Cycles per Byte                                             : 54.53
Instructions per Execution                                  : 52026.00
Instructions per Cycle                                      : 1.86
Instructions per Byte                                       : 101.61
Branches per Execution                                      : 361.45
Branch Misses per Execution                                 : 0.73
Cache References per Execution                              : 97.03
Cache Misses per Execution                                  : 74.68
----------------------------------------
Metrics for: glz::to_chars
Total Iterations to Stabilize                               : 421
Measured Iterations                                         : 20
Bytes Processed                                             : 512.00
Nanoseconds per Execution                                   : 6480.30
Frequency (GHz)                                             : 4.68
Throughput (MB/s)                                           : 75.95
Throughput Percentage Deviation (+/-%)                      : 17.58
Cycles per Execution                                        : 30314.40
Cycles per Byte                                             : 59.21
Instructions per Execution                                  : 51513.00
Instructions per Cycle                                      : 1.70
Instructions per Byte                                       : 100.61
Branches per Execution                                      : 438.25
Branch Misses per Execution                                 : 0.73
Cache References per Execution                              : 95.93
Cache Misses per Execution                                  : 73.59
----------------------------------------
Library benchmarksuite::internal::to_chars, is faster than library: glz::to_chars, by roughly: 11.36%.

This structured output helps you quickly identify which implementation is faster or more efficient.

Features

Dual Benchmarking Support

CPU Benchmarking: Traditional CPU performance measurement with hardware counters
GPU/CUDA Benchmarking: Native CUDA kernel benchmarking with grid/block configuration
Mixed Workloads: Compare CPU vs GPU implementations side-by-side
Automatic Device Selection: Choose benchmark type via bnch_swt::benchmark_types::cpu or bnch_swt::benchmark_types::cuda

Advanced Options

Cache Clearing: Optional cache eviction between iterations for cold-cache benchmarks
Custom Metrics: Define custom metric names for specialized benchmarks (e.g., compression ratios, custom throughput units)
Configurable Iterations: Separate control over warmup iterations and measured iterations
Programmatic Access: Retrieve raw performance metrics via get_results() for custom analysis

Hardware Introspection

CPU Properties: Comprehensive CPU detection and properties via benchmarksuite_cpu_properties.hpp
GPU Properties: CUDA device detection and properties via benchmarksuite_gpu_properties.hpp

Performance Counters

Cross-platform CPU counters: Windows, Linux, macOS, Android, Apple ARM
CUDA performance events: GPU-specific performance monitoring via counters/cuda_perf_events.hpp

Utilities

Cache management: Cross-platform cache clearing utilities
Aligned constants: Compile-time aligned data structures
Random generators: High-quality random data generation for benchmarks

API Conventions

As of v1.0.0, all APIs follow snake_case naming convention:

Functions: do_not_optimize_away(), generate_random_integers(), print_results()
Types: size_type, string_literal
Variables: bytes_processed, test_values

Migrating from Pre-1.0.0

If you're upgrading from an earlier version:

Update package name: benchmarksuite → benchmarksuite
Update include paths: All includes are lowercase (already standard)
Update API calls: Convert camelCase/PascalCase to snake_case
- doNotOptimizeAway() → do_not_optimize_away()
- printResults() → print_results()
- generateRandomIntegers() → generate_random_integers()
Change benchmark interface: Lambdas are replaced with structs
// Old (lambda-based)
benchmark_stage<"test">::run_benchmark<"name">([&] {
// code here
return bytes_processed;
});
// New (struct-based)
struct my_benchmark {
BNCH_SWT_HOST static uint64_t impl(/* params /) {
// code here
return bytes_processed;
}
};
benchmark_stage<"test">::run_benchmark<"name", my_benchmark>(/ args */);
Update template parameters: benchmark_stage now has more options
// Old (positional parameters)
benchmark_stage<"test", iterations, measured>
// New (with defaults and additional options)
benchmark_stage<"test", 200, 25, benchmark_types::cpu, false, "">
// ^^^ ^^ ^^^^^^^^^^^^^^^^^^ ^^^^^ ^^

// max measured type cache metric 6. New feature - Device types: You can now specify CPU or CUDA benchmarking:
// CPU (default)
benchmark_stage<"test", 200, 25, bnch_swt::benchmark_types::cpu>
// CUDA/GPU
benchmark_stage<"test", 100, 10, bnch_swt::benchmark_types::cuda> 7. New feature - Cache clearing: Enable cache clearing between iterations for CPU benchmarks:
// Clear cache between each iteration (5th parameter)
benchmark_stage<"test", 200, 25, benchmark_types::cpu, true> 8. New feature - Custom metrics: Specify custom metric names for specialized benchmarks:
// Use custom metric instead of default throughput (6th parameter)
benchmark_stage<"compression-test", 200, 25, benchmark_types::cpu, false, "compression-ratio">

Now you're ready to start benchmarking with benchmarksuite!

GitHub - RealTimeChris/benchmarksuite: A suite of benchmarks. (original) (raw)

Benchmark Suite

Compiler Support

Operating System Support

Quickstart Guide for benchmarksuite

Table of Contents

Installation

Method 1: vcpkg + CMake (Recommended)

Set C++ standard

For CUDA support (optional)

Find the package

Create your executable

Link against benchmarksuite (header-only, just sets up includes)

If using CUDA

Configure

Build

Method 2: Manual Installation

Add as subdirectory

Or set include directory

Requirements

Platform-Specific Notes

Verification

Basic Example

Creating Benchmarks

Benchmark Stage

Template Parameters

Methods

Benchmark Function Requirements

CPU vs GPU Benchmarking

CPU Benchmarks

GPU/CUDA Benchmarks

Mixed CPU/GPU Benchmarking

Cache Clearing Option

Custom Metrics

Running Benchmarks

Configure with vcpkg toolchain

Build

Run

Common CMake Options

Complete Project Example

C++23 required

CUDA 20 for GPU benchmarks

Find benchmarksuite

Create executable

Link benchmarksuite

CUDA properties

Optimization flags

Output and Results

Features

Dual Benchmarking Support

Advanced Options

Hardware Introspection

Performance Counters

Utilities

API Conventions

Migrating from Pre-1.0.0