Throughout our Defense and Commercial Consulting engagements, a common theme continually turns up in design, code reviews, and watercooler talk: performance.  Waxing theoretical about what should be fast is fun, but not usually very accurate.  To truly understand performance trades, you must measure.  Celero is a microbenchmark library designed as a tool to understand algorithmic trades within very small pieces of code. (C++, specifically, in this case.)

Introduction to Celero

Developing consistent and meaningful benchmark results for software is a complex task. Measurement tools exist (Intel® VTune™ Amplifier, SmartBear AQTime, Valgrind, etc.) external to applications, but they are sometimes expensive for small teams or cumbersome to utilize. This project, Celero, aims to be a small library which can be added to a C++ project and perform benchmarks on code in a way which is easy to reproduce, share, and compare among individual runs, developers, or projects. Celero uses a framework similar to that of GoogleTest to make its API easier to use and integrate into a project. Make automated benchmarking as much a part of your development process as automated testing.

Celero uses CMake to provide cross-platform builds. It does require a modern compiler (Visual C++ 2012+, GCC 4.7+, Clang 2.9+) due to its use of C++11.

Once Celero is added to your project. You can create dedicated benchmark projects and source files. For convenience, there is a single header file and a CELERO_MAIN macro that can be used to provide a main() for your benchmark project that will automatically execute all of your benchmark tests.

Key Features

  • Supports Windows, Linux, and OSX using C++11.
  • The timing utilities can be used directly in production code (independent of benchmarks).
  • Console table output is formatted as Markdown to easily copy/paste into documents.
  • Archive results to track performance over time.
  • Integrates into CI/CT/CD environments with JUnit-formatted output.
  • User-defined Experiment Values can scale test results, sample sizes, and user-defined properties for each run.
  • User-defined Measurements allow for measuring anything in addition to timing.
  • Supports Test Fixtures.
  • Supports fixed-time benchmark baselines.
  • Capture a rich set of timing statistics to a file.
  • Easily installed using CMake, Conan, or VCPkg.

Command Line

<celeroOutputExecutable> [-g groupNameToRun] [-t resultsTable.csv] [-j junitOutputFile.xml] [-a resultArchive.csv] [-d numberOfIterationsPerDistribution] [-h]
  • -g Use this option to run only one benchmark group out of all benchmarks contained within a test executable.
  • -t Writes all results to a CSV file. Very useful when using problem sets to graph performance.
  • -j Writes JUnit formatted XML output. To utilize JUnit output, benchmarks must use the _TEST version of the macros and specify an expected baseline multiple. When the test exceeds this multiple, the JUnit output will indicate a failure.
  • -a Builds or updates an archive of historical results, tracking current, best, and worst results for each benchmark.
  • -d (Experimental) builds a plot of four different sample sizes to investigate the distribution of sample results.

Celero Basics

Background

The goal, generally, of writing benchmarks is to measure the performance of a piece of code. Benchmarks are useful for comparing multiple solutions to the same problem to select the most appropriate one. Other times, benchmarks can highlight the performance impact of design or algorithm changes and quantify them in a meaningful way.

By measuring code performance, you eliminate errors in your assumptions about what the “right” solution is for performance. Only through measurement can you confirm that using a lookup table, for example, is faster than computing a value. Such lore (which is often repeated) can lead to bad design decisions and, ultimately, slower code.

The goal of writing good benchmarking code is to eliminate all of the noise and overhead to measure just the code under test. Sources of noise in the measurements include clock resolution noise, operating system background operations, test setup/teardown, framework overhead, and other unrelated system activity.

At a theoretical level, we want to measure “t”, the time to execute the code under test. In reality, we measure “t” plus all of this measurement noise.

These extraneous contributors to our measurement of “t” fluctuate over time. Therefore, we want to try to isolate “t’. The way this is accomplished is by making many measurements, but only keeping the smallest total. The smallest total is necessarily the one with the smallest noise contribution and closest to the actual time “t”.

Once this measurement is obtained, it has little meaning in isolation. It is important to create a baseline test by which to compare. A baseline should generally be a “classic” or “pure” solution to the problem on which you are measuring a solution. Once you have a baseline, you have a meaningful time to compare your algorithm against. Simply saying that your fancy sorting algorithm (fSort) sorted a million elements in 10 milliseconds is not sufficient by itself. However, compare that to a classic sorting algorithm baseline such as quicksort (qSort) and then you can say that fSort is 50% faster than qSort on a million elements. That is a meaningful and powerful measurement.

Implementation

Celero heavily utilizes C++11 features that are available in both Visual C++ 2012 and GCC 4.7. This greatly aided in making the code clean and portable. To make adopting the code easier, all definitions needed by a user are defined in a celero namespace within a single include file: Celero.h

Celero.h has within it the macro definitions that turn each of the user benchmark cases into its own unique class with the associated test fixture (if any) and then registers the test case within a Factory. The macros automatically associate baseline test cases with their associated test benchmarks so that, at run time, benchmark-relative numbers can be computed. This association is maintained by TestVector.

The TestVector utilizes the PImpl idiom to help hide implementation and keep the #include overhead of Celero.h to a minimum.

Celero reports its outputs to the command line. Since colors are nice (and perhaps contribute to the human factors/readability of the results), something beyond std::cout was called for. Console.h defines a simple color function, SetConsoleColor, which is utilized by the functions in the celero::print namespace to nicely format the program’s output.

Measuring benchmark execution time takes place in the TestFixture base class, from which all benchmarks are written are ultimately derived. First, the test fixture setup code is executed. Then, the start time for the test is retrieved and stored in microseconds using an unsigned long. This is done to reduce floating point error. Next, the specified number of operations (iterations) is executed. When complete, the end time is retrieved, the test fixture is torn down, and the measured time for the execution is returned and the results are saved.

This cycle is repeated for however many samples were specified. If no samples were specified (zero), then the test is repeated until it as ran for at least one second or at least 30 samples have been taken. While writing this specific part of the code, there was a definite “if-else” relationship. However, the bulk of the code was repeated within the “if” and “else” sections. An old fashioned function could have been used here, but it was very natural to utilize std::function to define a lambda that could be called and keep all of the code clean. (C++11 is a fantastic thing.) Finally, the results are printed to the screen.

General Program Flow

To summarize, this pseudo-code illustrates how the tests are executed internally:

for(Each Experiment)
{
  for(Each Sample)
  {
    // Call the virtual function
    // and DO NOT include its time in the measurement.
    experiment->setUp();

    // Start the Timer
    timer->start();

    // Run all iterations
    for(Each Iteration)
    {
      // Call the virtual function
      // and include its time in the measurement.
      experiment->onExperimentStart(x);

      // Run the code under test
      experiment->run(threads, iterations, experimentValue);
    
      // Call the virtual function
      // and include its time in the measurement.
      experiment->onExperimentEnd();
    }

    // Stop the Timer
    timer->stop();

    // Record data...

    // Call the virtual teardown function
    // and DO NOT include its time in the measurement.
    experiment->tearDown();
  }
}

Using the Code

Celero uses CMake to provide cross-platform builds. It does require a modern compiler (Visual C++ 2012 or GCC 4.7+) due to its use of C++11.

Once Celero is added to your project. You can create dedicated benchmark projects and source files. For convenience, there is a single header file and a CELERO_MAIN macro that can be used to provide a main() for your benchmark project that will automatically execute all of your benchmark tests.

Example Code

Here is an example of a simple Celero Benchmark. (Note: This is a complete, runnable example.)

#include <celero/Celero.h>

#include <random>

#ifndef WIN32
#include <cmath>
#include <cstdlib>
#endif

///
/// This is the main(int argc, char** argv) for the entire celero program.
/// You can write your own, or use this macro to insert the standard one into the project.
///
CELERO_MAIN

std::random_device RandomDevice;
std::uniform_int_distribution<int> UniformDistribution(0, 1024);

///
/// In reality, all of the "Complex" cases take the same amount of time to run.
/// The difference in the results is a product of measurement error.
///
/// Interestingly, taking the sin of a constant number here resulted in a 
/// great deal of optimization in clang and gcc.
///
BASELINE(DemoSimple, Baseline, 10, 1000000)
{
    celero::DoNotOptimizeAway(static_cast<float>(sin(UniformDistribution(RandomDevice))));
}

///
/// Run a test consisting of 1 sample of 710000 operations per measurement.
/// There are not enough samples here to likely get a meaningful result.
///
BENCHMARK(DemoSimple, Complex1, 1, 710000)
{
    celero::DoNotOptimizeAway(static_cast<float>(sin(fmod(UniformDistribution(RandomDevice), 3.14159265))));
}

///
/// Run a test consisting of 30 samples of 710000 operations per measurement.
/// There are not enough samples here to get a reasonable measurement
/// It should get a Basline number lower than the previous test.
///
BENCHMARK(DemoSimple, Complex2, 30, 710000)
{
    celero::DoNotOptimizeAway(static_cast<float>(sin(fmod(UniformDistribution(RandomDevice), 3.14159265))));
}

///
/// Run a test consisting of 60 samples of 710000 operations per measurement.
/// There are not enough samples here to get a reasonable measurement
/// It should get a Basline number lower than the previous test.
///
BENCHMARK(DemoSimple, Complex3, 60, 710000)
{
    celero::DoNotOptimizeAway(static_cast<float>(sin(fmod(UniformDistribution(RandomDevice), 3.14159265))));
}

The first thing we do in this code is to define a BASELINE test case. This template takes four arguments:

BASELINE(GroupName, BaselineName, Samples, Operations)
  • GroupName – The name of the benchmark group. This is used to batch together runs and results with their corresponding baseline measurement.
  • BaselineName – The name of this baseline for reporting purposes.
  • Samples – The total number of times you want to execute the given number of operations on the test code.
  • Operations – The total number of times you want to execute the test code per sample.

Samples and operations here are used to measure very fast code. If you know the code in your benchmark would take some time less than 100 milliseconds, for example, your operations number would say to execute the code “operations” number of times before taking a measurement. Samples define how many measurements to make.

Celero helps with this by allowing you to specify zero samples. Zero samples will tell Celero to make some statistically significant number of samples based on how long it takes to complete your specified number of operations. These numbers will be reported at run time.

The celero::DoNotOptimizeAway template is provided to ensure that the optimizing compiler does not eliminate your function or code. Since this feature is used in all of the sample benchmarks and their baseline, it’s time overhead is canceled out in the comparisons.

After the baseline is defined, various benchmarks are then defined. The syntax for the BENCHMARK macro is identical to that of the macro.

Results

Running Celero’s simple example experiment (celeroDemoSimple.exe) benchmark gave the following output on a PC:

Celero's cross-platform "DemoSimple" microbenchmark Output
Celero “DemoSimple” Output

The first test that executes will be the group’s baseline. Celero took 30 samples of 1000000 iterations of the code in our test. (Each set of 1000000 iterations was measured, and this was done 10 times and the smallest time was taken.) The “Baseline” value for the baseline measurement itself will always be 1.0.

After the baseline is complete, each individual test runs. Each test is executed and measured in the same way, however, there is an additional metric reported: Baseline. This compares the time it takes to compute the benchmark to the baseline. The data here shows that CeleroBenchTest.Complex1 takes 1.007949 times longer to execute than the baseline.

Finally, you need to ensure that the number of iterations and samples is producing stable output for your experiment cases. These numbers may be the same as your now-stable baseline case.

Get the Code

Celero is available on GitHub with an Apache 2.0 license.

Related posts