GitHub - RealTimeChris/benchmarksuite: A suite of benchmarks. (original) (raw)
Benchmark Suite
Hello and welcome to bnch_swt or "Benchmark Suite". This is a collection of classes/functions for the purpose of benchmarking CPU and GPU performance.
The following operating systems and compilers are officially supported:
Compiler Support
Operating System Support
Quickstart Guide for benchmarksuite
This guide will walk you through setting up and running benchmarks using benchmarksuite.
Table of Contents
- Installation
- Basic Example
- Creating Benchmarks
- CPU vs GPU Benchmarking
- Running Benchmarks
- Output and Results
- Features
- API Conventions
- Migrating from Pre-1.0.0
Installation
Method 1: vcpkg + CMake (Recommended)
Step 1: Add to vcpkg.json
Create or update your vcpkg.json in your project root:
{ "name": "your-project-name", "version": "1.0.0", "dependencies": [ "benchmarksuite" ] }
Step 2: Configure CMake
In your CMakeLists.txt:
cmake_minimum_required(VERSION 3.20) project(YourProject LANGUAGES CXX CUDA) # Add CUDA if using GPU benchmarks
Set C++ standard
set(CMAKE_CXX_STANDARD 23) set(CMAKE_CXX_STANDARD_REQUIRED ON)
For CUDA support (optional)
set(CMAKE_CUDA_STANDARD 20) set(CMAKE_CUDA_STANDARD_REQUIRED ON)
Find the package
find_package(benchmarksuite CONFIG REQUIRED)
Create your executable
add_executable(your_benchmark main.cpp)
Link against benchmarksuite (header-only, just sets up includes)
target_link_libraries(your_benchmark PRIVATE benchmarksuite::benchmarksuite)
If using CUDA
set_target_properties(your_benchmark PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
Step 3: Configure with vcpkg toolchain
Configure
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake
Build
cmake --build build --config Release
Step 4: Include in your code
#include <bnch_swt/index.hpp>
int main() { // Your benchmarks here return 0; }
Method 2: Manual Installation
If not using vcpkg, you can include benchmarksuite as a header-only library:
Step 1: Clone the repository
git clone https://github.com/RealTimeChris/benchmarksuite.git
Step 2: Add to CMake
Add as subdirectory
add_subdirectory(path/to/benchmarksuite)
Or set include directory
target_include_directories(your_target PRIVATE path/to/benchmarksuite/include)
Step 3: Include headers
#include <bnch_swt/index.hpp>
Requirements
To use benchmarksuite, ensure you have a C++23 (or later) compliant compiler.
For CPU Benchmarking:
- MSVC 2022 or later
- GCC 13 or later
- Clang 16 or later
For GPU/CUDA Benchmarking:
- NVIDIA CUDA Toolkit 11.0 or later
- NVCC compiler
- CUDA-capable GPU
Platform-Specific Notes
Windows:
- Use Visual Studio 2022 or later
- For CUDA: Install CUDA Toolkit from NVIDIA
Linux:
- Install build essentials:
sudo apt-get install build-essential - For CUDA: Install CUDA Toolkit via package manager or NVIDIA installer
macOS:
- Install Xcode Command Line Tools
- CUDA support not available on Apple Silicon (M1/M2/M3)
Verification
Verify your installation with a simple test:
#include <bnch_swt/index.hpp> #include
int main() { std::cout << "benchmarksuite successfully installed!" << std::endl; return 0; }
Basic Example
The following example demonstrates how to set up and run a benchmark comparing two integer-to-string conversion functions:
// Define benchmark functions as structs with static impl() methods struct glz_to_chars_benchmark { BNCH_SWT_HOST static uint64_t impl(std::vector& test_values, std::vectorstd::string& test_values_00, std::vectorstd::string& test_values_01) { uint64_t bytes_processed = 0; char newer_string[30]{}; for (uint64_t x = 0; x < test_values.size(); ++x) { std::memset(newer_string, '\0', sizeof(newer_string)); auto new_ptr = glz::to_chars(newer_string, test_values[x]); bytes_processed += test_values_00[x].size(); test_values_01[x] = std::string{newer_string, static_cast(new_ptr - newer_string)}; } return bytes_processed; } };
struct jsonifier_to_chars_benchmark { BNCH_SWT_HOST static uint64_t impl(std::vector& test_values, std::vectorstd::string& test_values_00, std::vectorstd::string& test_values_01) { uint64_t bytes_processed = 0; char newer_string[30]{}; for (uint64_t x = 0; x < test_values.size(); ++x) { std::memset(newer_string, '\0', sizeof(newer_string)); auto new_ptr = jsonifier_internal::to_chars(newer_string, test_values[x]); bytes_processed += test_values_00[x].size(); test_values_01[x] = std::string{newer_string, static_cast(new_ptr - newer_string)}; } return bytes_processed; } };
int main() { constexpr uint64_t count = 512;
// Setup test data
std::vector<int64_t> test_values = generate_random_integers<int64_t>(count, 20);
std::vector<std::string> test_values_00;
std::vector<std::string> test_values_01(count);
for (uint64_t x = 0; x < count; ++x) {
test_values_00.emplace_back(std::to_string(test_values[x]));
}
// Define benchmark stage with 200 total iterations, 25 measured, CPU benchmarking
using benchmark = bnch_swt::benchmark_stage<"int-to-string-comparison", 200, 25,
bnch_swt::benchmark_types::cpu>;
// Run benchmarks
benchmark::run_benchmark<"glz::to_chars", glz_to_chars_benchmark>(test_values, test_values_00, test_values_01);
benchmark::run_benchmark<"jsonifier::to_chars", jsonifier_to_chars_benchmark>(test_values, test_values_00, test_values_01);
// Print results with comparison
benchmark::print_results(true, true);
return 0;}
Creating Benchmarks
To create a benchmark:
- Define your benchmark functions as structs with a static
impl()method that returnsuint64_t(bytes processed) - Use
bnch_swt::benchmark_stagewith appropriate template parameters - Call
run_benchmarkwith your benchmark struct and any required arguments
Benchmark Stage
The benchmark_stage structure orchestrates each test and supports both CPU and GPU benchmarking:
// Full template signature template<bnch_swt::string_literal stage_name, // Required: benchmark stage name uint64_t max_execution_count = 200, // Total iterations (warmup + measured) uint64_t measured_iteration_count = 25, // Iterations to measure bnch_swt::benchmark_types benchmark_type = bnch_swt::benchmark_types::cpu, // CPU or CUDA bool clear_cpu_cache_between_each_iteration = false, // Cache clearing flag bnch_swt::string_literal metric_name = bnch_swt::string_literal<1>{} // Custom metric name
struct benchmark_stage;
// Common usage examples using cpu_benchmark = bnch_swt::benchmark_stage<"my-benchmark">; // Uses defaults: 200 total, 25 measured, CPU using gpu_benchmark = bnch_swt::benchmark_stage<"gpu-test", 100, 10, bnch_swt::benchmark_types::cuda>; using custom_metric = bnch_swt::benchmark_stage<"compression", 200, 25, bnch_swt::benchmark_types::cpu, false, "compression-ratio">;
Template Parameters
- stage_name (required): String literal identifying the benchmark stage
- max_execution_count (default 200): Total number of iterations including warmup
- measured_iteration_count (default 25): Number of iterations to measure for final metrics
- benchmark_type (default cpu):
bnch_swt::benchmark_types::cpuorbnch_swt::benchmark_types::cuda - clear_cpu_cache_between_each_iteration (default false): Whether to clear CPU caches between iterations
- metric_name (default empty): Custom metric name for specialized benchmarks (e.g., compression ratios)
Methods
run_benchmark<name, function_type>(args...): Executes the benchmark function'simpl()method with the provided arguments- name: String literal identifying this specific benchmark within the stage
- function_type: Struct type with a static
impl()method - For CPU:
run_benchmark<name, function_type>(args...)where args are forwarded toimpl() - For CUDA:
run_benchmark<name, function_type>(grid, block, shared_mem, bytes_processed, args...)where:
*grid: dim3 specifying grid dimensions
*block: dim3 specifying block dimensions
*shared_mem: uint64_t bytes of shared memory
*bytes_processed: uint64_t bytes processed for throughput calculation
*args...: Additional arguments forwarded to kernelimpl() - Returns:
performance_metrics<benchmark_type>object
print_results(show_comparison = true, show_metrics = true): Displays performance metrics and comparisons- show_comparison: Whether to show head-to-head comparisons between benchmarks
- show_metrics: Whether to show detailed hardware counter metrics
get_results(): Returns a sorted vector of allperformance_metricsfor programmatic access
Benchmark Function Requirements
Benchmark functions must be defined as structs with a static impl() method:
For CPU benchmarks:
struct my_cpu_benchmark { BNCH_SWT_HOST static uint64_t impl(/* your parameters /) { // Your CPU code to benchmark uint64_t bytes_processed = / calculate bytes */; return bytes_processed; // Must return bytes processed } };
For CUDA benchmarks:
struct my_cuda_benchmark { BNCH_SWT_DEVICE static void impl(/* your parameters */) { // Your CUDA kernel code (runs on device) // This code will be wrapped in a kernel launch by the framework int idx = blockIdx.x * blockDim.x + threadIdx.x; // ... your kernel logic } };
Key differences:
- CPU:
impl()returnsuint64_t(bytes processed) and usesBNCH_SWT_HOST - CUDA:
impl()returnsvoid, usesBNCH_SWT_DEVICE, and contains kernel code (not a kernel launch) - CUDA: Bytes processed is passed as a parameter to
run_benchmark(), not returned fromimpl() - CUDA: The framework automatically wraps your
impl()in a kernel launch with the specified grid/block dimensions
CPU vs GPU Benchmarking
As of v1.0.0, benchmarksuite supports both CPU and GPU (CUDA) benchmarking through the benchmark_types enum.
CPU Benchmarks
// Define CPU benchmark function struct cpu_computation_benchmark { BNCH_SWT_HOST static uint64_t impl(const std::vector& input, std::vector& output) { // Your CPU computation here for (size_t i = 0; i < input.size(); ++i) { output[i] = std::sqrt(input[i] * input[i] + 1.0f); } // Return bytes processed for throughput calculation return input.size() * sizeof(float); } };
// Create CPU benchmark stage (200 total iterations, 25 measured, CPU type) using cpu_stage = bnch_swt::benchmark_stage<"cpu-test", 200, 25, bnch_swt::benchmark_types::cpu>;
// Setup data constexpr size_t data_size = 1024 * 1024; std::vector input(data_size, 1.0f); std::vector output(data_size);
// Run the benchmark cpu_stage::run_benchmark<"my-cpu-function", cpu_computation_benchmark>(input, output);
// Print results cpu_stage::print_results();
GPU/CUDA Benchmarks
// Define CUDA kernel benchmark struct cuda_kernel_benchmark { BNCH_SWT_DEVICE static void impl(float* data, uint64_t size) { // Your CUDA kernel code here // This runs inside the kernel, NOT as a kernel launch int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { data[idx] = data[idx] * 2.0f; // Example operation } } };
// Create CUDA benchmark stage using cuda_stage = bnch_swt::benchmark_stage<"gpu-test", 100, 10, bnch_swt::benchmark_types::cuda>;
// Setup constexpr uint64_t data_size = 1024 * 1024; float* gpu_data; cudaMalloc(&gpu_data, data_size * sizeof(float));
// Configure kernel launch parameters dim3 grid{256, 1, 1}; dim3 block{256, 1, 1}; uint64_t shared_memory = 0; uint64_t bytes_processed = data_size * sizeof(float);
// Run CUDA benchmark // Parameters: grid, block, shared_mem, bytes_processed, then your kernel args cuda_stage::run_benchmark<"my-cuda-kernel", cuda_kernel_benchmark>( grid, block, shared_memory, bytes_processed, gpu_data, data_size );
cuda_stage::print_results(); cudaFree(gpu_data);
Important: For CUDA benchmarks, the impl() method contains the kernel code itself (not a kernel launch). The benchmarking framework wraps it in a kernel launch using the provided grid/block dimensions.
Mixed CPU/GPU Benchmarking
You can benchmark CPU and GPU implementations side-by-side:
constexpr uint64_t data_size = 1024 * 1024;
// CPU benchmark function struct cpu_process_benchmark { BNCH_SWT_HOST static uint64_t impl(std::vector& cpu_data) { // Process data on CPU for (size_t i = 0; i < cpu_data.size(); ++i) { cpu_data[i] = cpu_data[i] * 2.0f; } return cpu_data.size() * sizeof(float); } };
// GPU benchmark function (kernel code, NOT kernel launch) struct gpu_process_benchmark { BNCH_SWT_DEVICE static void impl(float* gpu_data, uint64_t size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { gpu_data[idx] = gpu_data[idx] * 2.0f; } } };
// Setup test data std::vector cpu_data(data_size); float* gpu_data; cudaMalloc(&gpu_data, data_size * sizeof(float));
// CPU version using cpu_test = bnch_swt::benchmark_stage<"cpu-vs-gpu", 100, 10, bnch_swt::benchmark_types::cpu>; cpu_test::run_benchmark<"cpu-version", cpu_process_benchmark>(cpu_data);
// GPU version
using gpu_test = bnch_swt::benchmark_stage<"cpu-vs-gpu", 100, 10, bnch_swt::benchmark_types::cuda>;
dim3 grid{(data_size + 255) / 256, 1, 1};
dim3 block{256, 1, 1};
gpu_test::run_benchmark<"gpu-version", gpu_process_benchmark>(
grid, block, 0, data_size * sizeof(float),
gpu_data, data_size
);
// Print both results for comparison cpu_test::print_results(); gpu_test::print_results();
cudaFree(gpu_data);
This allows direct performance comparison between CPU and GPU implementations of the same algorithm.
Cache Clearing Option
For more accurate CPU benchmarks, you can enable cache clearing between iterations:
// Enable cache clearing (5th template parameter) using cache_cleared = bnch_swt::benchmark_stage<"cache-test", 200, 25, bnch_swt::benchmark_types::cpu, true>;
This is useful when benchmarking memory-bound operations where you want to measure cold cache performance.
Custom Metrics
You can specify custom metric names for specialized benchmarks that don't measure traditional throughput:
// Compression benchmark with custom metric name using compression_bench = bnch_swt::benchmark_stage<"compression-test", 200, 25, bnch_swt::benchmark_types::cpu, false, "compression-ratio">;
struct compress_benchmark { BNCH_SWT_HOST static uint64_t impl(const std::vector& input) { auto compressed = compress_data(input); // Return custom metric value (e.g., compression ratio * 1000) return (input.size() * 1000) / compressed.size(); } };
compression_bench::run_benchmark<"my-compressor", compress_benchmark>(input_data); compression_bench::print_results();
When a custom metric name is provided, the results will display your custom metric instead of standard MB/s throughput.
Running Benchmarks
With vcpkg + CMake (recommended):
Configure with vcpkg toolchain
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake -DCMAKE_BUILD_TYPE=Release
Build
cmake --build build --config Release
Run
./build/your_benchmark # Linux/macOS .\build\Release\your_benchmark.exe # Windows
Manual CMake build:
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release ./build/your_benchmark
For CUDA benchmarks, ensure CUDA is enabled:
cmake -B build -S .
-DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_CUDA_ARCHITECTURES=86 # Adjust for your GPU architecture
cmake --build build --config Release
Common CMake Options
-DCMAKE_BUILD_TYPE=Release- Build optimized release version-DCMAKE_CUDA_ARCHITECTURES=86- Target specific CUDA compute capability (e.g., 86 for RTX 30xx/40xx)-DCMAKE_CXX_COMPILER=clang++- Specify C++ compiler-DCMAKE_CUDA_COMPILER=nvcc- Specify CUDA compiler
Complete Project Example
Project structure:
my-benchmark/
├── CMakeLists.txt
├── vcpkg.json
├── main.cpp
└── benchmarks/
├── cpu_benchmark.hpp
└── gpu_benchmark.cuh
CMakeLists.txt:
cmake_minimum_required(VERSION 3.20) project(MyBenchmark LANGUAGES CXX CUDA)
C++23 required
set(CMAKE_CXX_STANDARD 23) set(CMAKE_CXX_STANDARD_REQUIRED ON)
CUDA 20 for GPU benchmarks
set(CMAKE_CUDA_STANDARD 20) set(CMAKE_CUDA_STANDARD_REQUIRED ON)
Find benchmarksuite
find_package(benchmarksuite CONFIG REQUIRED)
Create executable
add_executable(my_benchmark main.cpp benchmarks/cpu_benchmark.hpp benchmarks/gpu_benchmark.cuh )
Link benchmarksuite
target_link_libraries(my_benchmark PRIVATE benchmarksuite::benchmarksuite )
CUDA properties
set_target_properties(my_benchmark PROPERTIES CUDA_SEPARABLE_COMPILATION ON CUDA_RESOLVE_DEVICE_SYMBOLS ON )
Optimization flags
if(MSVC) target_compile_options(my_benchmark PRIVATE /O2 /arch:AVX2) else() target_compile_options(my_benchmark PRIVATE -O3 -march=native) endif()
vcpkg.json:
{ "name": "my-benchmark", "version": "1.0.0", "dependencies": [ "benchmarksuite" ] }
Output and Results
Performance Metrics for: int-to-string-comparisons-1
Metrics for: benchmarksuite::internal::to_chars
Total Iterations to Stabilize : 394
Measured Iterations : 20
Bytes Processed : 512.00
Nanoseconds per Execution : 5785.25
Frequency (GHz) : 4.83
Throughput (MB/s) : 84.58
Throughput Percentage Deviation (+/-%) : 8.36
Cycles per Execution : 27921.20
Cycles per Byte : 54.53
Instructions per Execution : 52026.00
Instructions per Cycle : 1.86
Instructions per Byte : 101.61
Branches per Execution : 361.45
Branch Misses per Execution : 0.73
Cache References per Execution : 97.03
Cache Misses per Execution : 74.68
----------------------------------------
Metrics for: glz::to_chars
Total Iterations to Stabilize : 421
Measured Iterations : 20
Bytes Processed : 512.00
Nanoseconds per Execution : 6480.30
Frequency (GHz) : 4.68
Throughput (MB/s) : 75.95
Throughput Percentage Deviation (+/-%) : 17.58
Cycles per Execution : 30314.40
Cycles per Byte : 59.21
Instructions per Execution : 51513.00
Instructions per Cycle : 1.70
Instructions per Byte : 100.61
Branches per Execution : 438.25
Branch Misses per Execution : 0.73
Cache References per Execution : 95.93
Cache Misses per Execution : 73.59
----------------------------------------
Library benchmarksuite::internal::to_chars, is faster than library: glz::to_chars, by roughly: 11.36%.
This structured output helps you quickly identify which implementation is faster or more efficient.
Features
Dual Benchmarking Support
- CPU Benchmarking: Traditional CPU performance measurement with hardware counters
- GPU/CUDA Benchmarking: Native CUDA kernel benchmarking with grid/block configuration
- Mixed Workloads: Compare CPU vs GPU implementations side-by-side
- Automatic Device Selection: Choose benchmark type via
bnch_swt::benchmark_types::cpuorbnch_swt::benchmark_types::cuda
Advanced Options
- Cache Clearing: Optional cache eviction between iterations for cold-cache benchmarks
- Custom Metrics: Define custom metric names for specialized benchmarks (e.g., compression ratios, custom throughput units)
- Configurable Iterations: Separate control over warmup iterations and measured iterations
- Programmatic Access: Retrieve raw performance metrics via
get_results()for custom analysis
Hardware Introspection
- CPU Properties: Comprehensive CPU detection and properties via
benchmarksuite_cpu_properties.hpp - GPU Properties: CUDA device detection and properties via
benchmarksuite_gpu_properties.hpp
Performance Counters
- Cross-platform CPU counters: Windows, Linux, macOS, Android, Apple ARM
- CUDA performance events: GPU-specific performance monitoring via
counters/cuda_perf_events.hpp
Utilities
- Cache management: Cross-platform cache clearing utilities
- Aligned constants: Compile-time aligned data structures
- Random generators: High-quality random data generation for benchmarks
API Conventions
As of v1.0.0, all APIs follow snake_case naming convention:
- Functions:
do_not_optimize_away(),generate_random_integers(),print_results() - Types:
size_type,string_literal - Variables:
bytes_processed,test_values
Migrating from Pre-1.0.0
If you're upgrading from an earlier version:
- Update package name:
benchmarksuite→benchmarksuite - Update include paths: All includes are lowercase (already standard)
- Update API calls: Convert camelCase/PascalCase to snake_case
doNotOptimizeAway()→do_not_optimize_away()printResults()→print_results()generateRandomIntegers()→generate_random_integers()
- Change benchmark interface: Lambdas are replaced with structs
// Old (lambda-based)
benchmark_stage<"test">::run_benchmark<"name">([&] {
// code here
return bytes_processed;
});
// New (struct-based)
struct my_benchmark {
BNCH_SWT_HOST static uint64_t impl(/* params /) {
// code here
return bytes_processed;
}
};
benchmark_stage<"test">::run_benchmark<"name", my_benchmark>(/ args */); - Update template parameters: benchmark_stage now has more options
// Old (positional parameters)
benchmark_stage<"test", iterations, measured>
// New (with defaults and additional options)
benchmark_stage<"test", 200, 25, benchmark_types::cpu, false, "">
// ^^^ ^^ ^^^^^^^^^^^^^^^^^^ ^^^^^ ^^
// max measured type cache metric
6. New feature - Device types: You can now specify CPU or CUDA benchmarking:
// CPU (default)
benchmark_stage<"test", 200, 25, bnch_swt::benchmark_types::cpu>
// CUDA/GPU
benchmark_stage<"test", 100, 10, bnch_swt::benchmark_types::cuda>
7. New feature - Cache clearing: Enable cache clearing between iterations for CPU benchmarks:
// Clear cache between each iteration (5th parameter)
benchmark_stage<"test", 200, 25, benchmark_types::cpu, true>
8. New feature - Custom metrics: Specify custom metric names for specialized benchmarks:
// Use custom metric instead of default throughput (6th parameter)
benchmark_stage<"compression-test", 200, 25, benchmark_types::cpu, false, "compression-ratio">
Now you're ready to start benchmarking with benchmarksuite!