GitHub - RealTimeChris/benchmarksuite: A suite of benchmarks. (original) (raw)

Benchmark Suite

Hello and welcome to bnch_swt or "Benchmark Suite". This is a collection of classes/functions for the purpose of benchmarking CPU and GPU performance.

The following operating systems and compilers are officially supported:

Compiler Support


MSVC GCC CLANG NVCC

Operating System Support


Windows Linux Mac

Quickstart Guide for benchmarksuite

This guide will walk you through setting up and running benchmarks using benchmarksuite.

Table of Contents

Installation

Step 1: Add to vcpkg.json

Create or update your vcpkg.json in your project root:

{ "name": "your-project-name", "version": "1.0.0", "dependencies": [ "benchmarksuite" ] }

Step 2: Configure CMake

In your CMakeLists.txt:

cmake_minimum_required(VERSION 3.20) project(YourProject LANGUAGES CXX CUDA) # Add CUDA if using GPU benchmarks

Set C++ standard

set(CMAKE_CXX_STANDARD 23) set(CMAKE_CXX_STANDARD_REQUIRED ON)

For CUDA support (optional)

set(CMAKE_CUDA_STANDARD 20) set(CMAKE_CUDA_STANDARD_REQUIRED ON)

Find the package

find_package(benchmarksuite CONFIG REQUIRED)

Create your executable

add_executable(your_benchmark main.cpp)

Link against benchmarksuite (header-only, just sets up includes)

target_link_libraries(your_benchmark PRIVATE benchmarksuite::benchmarksuite)

If using CUDA

set_target_properties(your_benchmark PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

Step 3: Configure with vcpkg toolchain

Configure

cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake

Build

cmake --build build --config Release

Step 4: Include in your code

#include <bnch_swt/index.hpp>

int main() { // Your benchmarks here return 0; }

Method 2: Manual Installation

If not using vcpkg, you can include benchmarksuite as a header-only library:

Step 1: Clone the repository

git clone https://github.com/RealTimeChris/benchmarksuite.git

Step 2: Add to CMake

Add as subdirectory

add_subdirectory(path/to/benchmarksuite)

Or set include directory

target_include_directories(your_target PRIVATE path/to/benchmarksuite/include)

Step 3: Include headers

#include <bnch_swt/index.hpp>

Requirements

To use benchmarksuite, ensure you have a C++23 (or later) compliant compiler.

For CPU Benchmarking:

For GPU/CUDA Benchmarking:

Platform-Specific Notes

Windows:

Linux:

macOS:

Verification

Verify your installation with a simple test:

#include <bnch_swt/index.hpp> #include

int main() { std::cout << "benchmarksuite successfully installed!" << std::endl; return 0; }

Basic Example

The following example demonstrates how to set up and run a benchmark comparing two integer-to-string conversion functions:

// Define benchmark functions as structs with static impl() methods struct glz_to_chars_benchmark { BNCH_SWT_HOST static uint64_t impl(std::vector& test_values, std::vectorstd::string& test_values_00, std::vectorstd::string& test_values_01) { uint64_t bytes_processed = 0; char newer_string[30]{}; for (uint64_t x = 0; x < test_values.size(); ++x) { std::memset(newer_string, '\0', sizeof(newer_string)); auto new_ptr = glz::to_chars(newer_string, test_values[x]); bytes_processed += test_values_00[x].size(); test_values_01[x] = std::string{newer_string, static_cast(new_ptr - newer_string)}; } return bytes_processed; } };

struct jsonifier_to_chars_benchmark { BNCH_SWT_HOST static uint64_t impl(std::vector& test_values, std::vectorstd::string& test_values_00, std::vectorstd::string& test_values_01) { uint64_t bytes_processed = 0; char newer_string[30]{}; for (uint64_t x = 0; x < test_values.size(); ++x) { std::memset(newer_string, '\0', sizeof(newer_string)); auto new_ptr = jsonifier_internal::to_chars(newer_string, test_values[x]); bytes_processed += test_values_00[x].size(); test_values_01[x] = std::string{newer_string, static_cast(new_ptr - newer_string)}; } return bytes_processed; } };

int main() { constexpr uint64_t count = 512;

// Setup test data
std::vector<int64_t> test_values = generate_random_integers<int64_t>(count, 20);
std::vector<std::string> test_values_00;
std::vector<std::string> test_values_01(count);

for (uint64_t x = 0; x < count; ++x) {
    test_values_00.emplace_back(std::to_string(test_values[x]));
}

// Define benchmark stage with 200 total iterations, 25 measured, CPU benchmarking
using benchmark = bnch_swt::benchmark_stage<"int-to-string-comparison", 200, 25, 
                                             bnch_swt::benchmark_types::cpu>;

// Run benchmarks
benchmark::run_benchmark<"glz::to_chars", glz_to_chars_benchmark>(test_values, test_values_00, test_values_01);
benchmark::run_benchmark<"jsonifier::to_chars", jsonifier_to_chars_benchmark>(test_values, test_values_00, test_values_01);

// Print results with comparison
benchmark::print_results(true, true);

return 0;

}

Creating Benchmarks

To create a benchmark:

  1. Define your benchmark functions as structs with a static impl() method that returns uint64_t (bytes processed)
  2. Use bnch_swt::benchmark_stage with appropriate template parameters
  3. Call run_benchmark with your benchmark struct and any required arguments

Benchmark Stage

The benchmark_stage structure orchestrates each test and supports both CPU and GPU benchmarking:

// Full template signature template<bnch_swt::string_literal stage_name, // Required: benchmark stage name uint64_t max_execution_count = 200, // Total iterations (warmup + measured) uint64_t measured_iteration_count = 25, // Iterations to measure bnch_swt::benchmark_types benchmark_type = bnch_swt::benchmark_types::cpu, // CPU or CUDA bool clear_cpu_cache_between_each_iteration = false, // Cache clearing flag bnch_swt::string_literal metric_name = bnch_swt::string_literal<1>{} // Custom metric name

struct benchmark_stage;

// Common usage examples using cpu_benchmark = bnch_swt::benchmark_stage<"my-benchmark">; // Uses defaults: 200 total, 25 measured, CPU using gpu_benchmark = bnch_swt::benchmark_stage<"gpu-test", 100, 10, bnch_swt::benchmark_types::cuda>; using custom_metric = bnch_swt::benchmark_stage<"compression", 200, 25, bnch_swt::benchmark_types::cpu, false, "compression-ratio">;

Template Parameters

Methods

Benchmark Function Requirements

Benchmark functions must be defined as structs with a static impl() method:

For CPU benchmarks:

struct my_cpu_benchmark { BNCH_SWT_HOST static uint64_t impl(/* your parameters /) { // Your CPU code to benchmark uint64_t bytes_processed = / calculate bytes */; return bytes_processed; // Must return bytes processed } };

For CUDA benchmarks:

struct my_cuda_benchmark { BNCH_SWT_DEVICE static void impl(/* your parameters */) { // Your CUDA kernel code (runs on device) // This code will be wrapped in a kernel launch by the framework int idx = blockIdx.x * blockDim.x + threadIdx.x; // ... your kernel logic } };

Key differences:

CPU vs GPU Benchmarking

As of v1.0.0, benchmarksuite supports both CPU and GPU (CUDA) benchmarking through the benchmark_types enum.

CPU Benchmarks

// Define CPU benchmark function struct cpu_computation_benchmark { BNCH_SWT_HOST static uint64_t impl(const std::vector& input, std::vector& output) { // Your CPU computation here for (size_t i = 0; i < input.size(); ++i) { output[i] = std::sqrt(input[i] * input[i] + 1.0f); } // Return bytes processed for throughput calculation return input.size() * sizeof(float); } };

// Create CPU benchmark stage (200 total iterations, 25 measured, CPU type) using cpu_stage = bnch_swt::benchmark_stage<"cpu-test", 200, 25, bnch_swt::benchmark_types::cpu>;

// Setup data constexpr size_t data_size = 1024 * 1024; std::vector input(data_size, 1.0f); std::vector output(data_size);

// Run the benchmark cpu_stage::run_benchmark<"my-cpu-function", cpu_computation_benchmark>(input, output);

// Print results cpu_stage::print_results();

GPU/CUDA Benchmarks

// Define CUDA kernel benchmark struct cuda_kernel_benchmark { BNCH_SWT_DEVICE static void impl(float* data, uint64_t size) { // Your CUDA kernel code here // This runs inside the kernel, NOT as a kernel launch int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { data[idx] = data[idx] * 2.0f; // Example operation } } };

// Create CUDA benchmark stage using cuda_stage = bnch_swt::benchmark_stage<"gpu-test", 100, 10, bnch_swt::benchmark_types::cuda>;

// Setup constexpr uint64_t data_size = 1024 * 1024; float* gpu_data; cudaMalloc(&gpu_data, data_size * sizeof(float));

// Configure kernel launch parameters dim3 grid{256, 1, 1}; dim3 block{256, 1, 1}; uint64_t shared_memory = 0; uint64_t bytes_processed = data_size * sizeof(float);

// Run CUDA benchmark // Parameters: grid, block, shared_mem, bytes_processed, then your kernel args cuda_stage::run_benchmark<"my-cuda-kernel", cuda_kernel_benchmark>( grid, block, shared_memory, bytes_processed, gpu_data, data_size );

cuda_stage::print_results(); cudaFree(gpu_data);

Important: For CUDA benchmarks, the impl() method contains the kernel code itself (not a kernel launch). The benchmarking framework wraps it in a kernel launch using the provided grid/block dimensions.

Mixed CPU/GPU Benchmarking

You can benchmark CPU and GPU implementations side-by-side:

constexpr uint64_t data_size = 1024 * 1024;

// CPU benchmark function struct cpu_process_benchmark { BNCH_SWT_HOST static uint64_t impl(std::vector& cpu_data) { // Process data on CPU for (size_t i = 0; i < cpu_data.size(); ++i) { cpu_data[i] = cpu_data[i] * 2.0f; } return cpu_data.size() * sizeof(float); } };

// GPU benchmark function (kernel code, NOT kernel launch) struct gpu_process_benchmark { BNCH_SWT_DEVICE static void impl(float* gpu_data, uint64_t size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { gpu_data[idx] = gpu_data[idx] * 2.0f; } } };

// Setup test data std::vector cpu_data(data_size); float* gpu_data; cudaMalloc(&gpu_data, data_size * sizeof(float));

// CPU version using cpu_test = bnch_swt::benchmark_stage<"cpu-vs-gpu", 100, 10, bnch_swt::benchmark_types::cpu>; cpu_test::run_benchmark<"cpu-version", cpu_process_benchmark>(cpu_data);

// GPU version
using gpu_test = bnch_swt::benchmark_stage<"cpu-vs-gpu", 100, 10, bnch_swt::benchmark_types::cuda>; dim3 grid{(data_size + 255) / 256, 1, 1}; dim3 block{256, 1, 1}; gpu_test::run_benchmark<"gpu-version", gpu_process_benchmark>( grid, block, 0, data_size * sizeof(float), gpu_data, data_size );

// Print both results for comparison cpu_test::print_results(); gpu_test::print_results();

cudaFree(gpu_data);

This allows direct performance comparison between CPU and GPU implementations of the same algorithm.

Cache Clearing Option

For more accurate CPU benchmarks, you can enable cache clearing between iterations:

// Enable cache clearing (5th template parameter) using cache_cleared = bnch_swt::benchmark_stage<"cache-test", 200, 25, bnch_swt::benchmark_types::cpu, true>;

This is useful when benchmarking memory-bound operations where you want to measure cold cache performance.

Custom Metrics

You can specify custom metric names for specialized benchmarks that don't measure traditional throughput:

// Compression benchmark with custom metric name using compression_bench = bnch_swt::benchmark_stage<"compression-test", 200, 25, bnch_swt::benchmark_types::cpu, false, "compression-ratio">;

struct compress_benchmark { BNCH_SWT_HOST static uint64_t impl(const std::vector& input) { auto compressed = compress_data(input); // Return custom metric value (e.g., compression ratio * 1000) return (input.size() * 1000) / compressed.size(); } };

compression_bench::run_benchmark<"my-compressor", compress_benchmark>(input_data); compression_bench::print_results();

When a custom metric name is provided, the results will display your custom metric instead of standard MB/s throughput.

Running Benchmarks

With vcpkg + CMake (recommended):

Configure with vcpkg toolchain

cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake -DCMAKE_BUILD_TYPE=Release

Build

cmake --build build --config Release

Run

./build/your_benchmark # Linux/macOS .\build\Release\your_benchmark.exe # Windows

Manual CMake build:

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release ./build/your_benchmark

For CUDA benchmarks, ensure CUDA is enabled:

cmake -B build -S .
-DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_CUDA_ARCHITECTURES=86 # Adjust for your GPU architecture

cmake --build build --config Release

Common CMake Options

Complete Project Example

Project structure:

my-benchmark/
├── CMakeLists.txt
├── vcpkg.json
├── main.cpp
└── benchmarks/
    ├── cpu_benchmark.hpp
    └── gpu_benchmark.cuh

CMakeLists.txt:

cmake_minimum_required(VERSION 3.20) project(MyBenchmark LANGUAGES CXX CUDA)

C++23 required

set(CMAKE_CXX_STANDARD 23) set(CMAKE_CXX_STANDARD_REQUIRED ON)

CUDA 20 for GPU benchmarks

set(CMAKE_CUDA_STANDARD 20) set(CMAKE_CUDA_STANDARD_REQUIRED ON)

Find benchmarksuite

find_package(benchmarksuite CONFIG REQUIRED)

Create executable

add_executable(my_benchmark main.cpp benchmarks/cpu_benchmark.hpp benchmarks/gpu_benchmark.cuh )

Link benchmarksuite

target_link_libraries(my_benchmark PRIVATE benchmarksuite::benchmarksuite )

CUDA properties

set_target_properties(my_benchmark PROPERTIES CUDA_SEPARABLE_COMPILATION ON CUDA_RESOLVE_DEVICE_SYMBOLS ON )

Optimization flags

if(MSVC) target_compile_options(my_benchmark PRIVATE /O2 /arch:AVX2) else() target_compile_options(my_benchmark PRIVATE -O3 -march=native) endif()

vcpkg.json:

{ "name": "my-benchmark", "version": "1.0.0", "dependencies": [ "benchmarksuite" ] }

Output and Results

Performance Metrics for: int-to-string-comparisons-1
Metrics for: benchmarksuite::internal::to_chars
Total Iterations to Stabilize                               : 394
Measured Iterations                                         : 20
Bytes Processed                                             : 512.00
Nanoseconds per Execution                                   : 5785.25
Frequency (GHz)                                             : 4.83
Throughput (MB/s)                                           : 84.58
Throughput Percentage Deviation (+/-%)                      : 8.36
Cycles per Execution                                        : 27921.20
Cycles per Byte                                             : 54.53
Instructions per Execution                                  : 52026.00
Instructions per Cycle                                      : 1.86
Instructions per Byte                                       : 101.61
Branches per Execution                                      : 361.45
Branch Misses per Execution                                 : 0.73
Cache References per Execution                              : 97.03
Cache Misses per Execution                                  : 74.68
----------------------------------------
Metrics for: glz::to_chars
Total Iterations to Stabilize                               : 421
Measured Iterations                                         : 20
Bytes Processed                                             : 512.00
Nanoseconds per Execution                                   : 6480.30
Frequency (GHz)                                             : 4.68
Throughput (MB/s)                                           : 75.95
Throughput Percentage Deviation (+/-%)                      : 17.58
Cycles per Execution                                        : 30314.40
Cycles per Byte                                             : 59.21
Instructions per Execution                                  : 51513.00
Instructions per Cycle                                      : 1.70
Instructions per Byte                                       : 100.61
Branches per Execution                                      : 438.25
Branch Misses per Execution                                 : 0.73
Cache References per Execution                              : 95.93
Cache Misses per Execution                                  : 73.59
----------------------------------------
Library benchmarksuite::internal::to_chars, is faster than library: glz::to_chars, by roughly: 11.36%.

This structured output helps you quickly identify which implementation is faster or more efficient.

Features

Dual Benchmarking Support

Advanced Options

Hardware Introspection

Performance Counters

Utilities

API Conventions

As of v1.0.0, all APIs follow snake_case naming convention:

Migrating from Pre-1.0.0

If you're upgrading from an earlier version:

  1. Update package name: benchmarksuitebenchmarksuite
  2. Update include paths: All includes are lowercase (already standard)
  3. Update API calls: Convert camelCase/PascalCase to snake_case
    • doNotOptimizeAway()do_not_optimize_away()
    • printResults()print_results()
    • generateRandomIntegers()generate_random_integers()
  4. Change benchmark interface: Lambdas are replaced with structs
    // Old (lambda-based)
    benchmark_stage<"test">::run_benchmark<"name">([&] {
    // code here
    return bytes_processed;
    });
    // New (struct-based)
    struct my_benchmark {
    BNCH_SWT_HOST static uint64_t impl(/* params /) {
    // code here
    return bytes_processed;
    }
    };
    benchmark_stage<"test">::run_benchmark<"name", my_benchmark>(/
    args */);
  5. Update template parameters: benchmark_stage now has more options
    // Old (positional parameters)
    benchmark_stage<"test", iterations, measured>
    // New (with defaults and additional options)
    benchmark_stage<"test", 200, 25, benchmark_types::cpu, false, "">
    // ^^^ ^^ ^^^^^^^^^^^^^^^^^^ ^^^^^ ^^

// max measured type cache metric 6. New feature - Device types: You can now specify CPU or CUDA benchmarking:
// CPU (default)
benchmark_stage<"test", 200, 25, bnch_swt::benchmark_types::cpu>
// CUDA/GPU
benchmark_stage<"test", 100, 10, bnch_swt::benchmark_types::cuda> 7. New feature - Cache clearing: Enable cache clearing between iterations for CPU benchmarks:
// Clear cache between each iteration (5th parameter)
benchmark_stage<"test", 200, 25, benchmark_types::cpu, true> 8. New feature - Custom metrics: Specify custom metric names for specialized benchmarks:
// Use custom metric instead of default throughput (6th parameter)
benchmark_stage<"compression-test", 200, 25, benchmark_types::cpu, false, "compression-ratio">


Now you're ready to start benchmarking with benchmarksuite!