package multicore-bench
Install
Dune Dependency
Authors
Maintainers
Sources
sha256=11dd1cef2e1ac845464e2155ac6d03ff4d08e2bd4890aa1b0e6c82197342fb70
sha512=393da8f48330fae8151a9468f8cf1e51c1c323b954c9eb96f1f5b16cc66a7dd6a1674e8eacac9f265d9d7a1929e152d0bcf257191f57a48e2b41c3fab3c8758e
Description
Published: 17 Feb 2024
README
README.md
Multicore-bench
Multicore bench is a framework for writing multicore benchmark executables to run on current-bench.
Benchmarking multicore algorithms tends to a require certain amount of setup, such as spawning domains, synchronizing them before work, timing the work, collecting the times, and joining domains, that this framework tries to take care of for you as conveniently as possible. Furthermore, benchmarking multicore algorithms in OCaml also involves a number of pitfalls related to how the OCaml runtime works. For example, when only a single domain is running, several operations provided by the OCaml runtime use specialized implementations that take advantage of the fact that there is only a single domain running. In most cases, when trying to benchmark multicore algorithms, you don't actually want to measure those specialized runtime implementations.
The design of multicore bench is considered experimental. We are planning to improve the design along with current-bench in the future to allow more useful benchmarking experience.
Example: Benchmarking Atomic.incr
under contention
Let's look at a simple example with detailed comments of how one might benchmark Atomic.incr
under contention.
Note that this example is written here as a MDX document or test. Normally you would write a benchmark as a command line executable and would likely compile it in release mode with a native compiler. You can find examples of multicore benchmarks from the Saturn and Kcas libraries.
We first open the Multicore_bench
module:
# open Multicore_bench
This brings into scope multiple modules including Suite
, Util
, Times
, and Cmd
that we used below.
Typically one would divide a benchmark executable into benchmark suites for different algorithms and data structures. To illustrate that pattern, let's create a module Bench_atomic
for our benchmarks suite on atomics:
# module Bench_atomic : sig
(* The entrypoint to a suite is basically a function. There is a type
alias for the signature. *)
val run_suite : Suite.t
end = struct
(* [run_one] runs a single benchmark with the given budget and number of
domains. *)
let run_one ~budgetf ~n_domains () =
(* We scale number of operations using [Util.iter_factor], which depends
various factors such as whether we are running on a 32- or 64-bit
machine, native or bytecode, and whether we are running on multicore
OCaml. *)
let n = 10 * Util.iter_factor in
(* In this example, [atomic] is the data structure we are benchmarking. *)
let atomic =
Atomic.make 0
|> Multicore_magic.copy_as_padded
(* We explicitly pad the [atomic] to avoid false sharing. With false
sharing measurements are likely to have a lot of noise that makes
it difficult to get useful results. *)
in
(* We store the number of operations to perform in an atomic. The idea
is that we want all the workers or domains to work at the same time
as much as possible, because we want to measure performance under
contention. So, instead of e.g. simply having each domain run a
fixed count loop, which could lead to some domains finishing well
before others, we let the number of operations performed by each domain
vary. *)
let n_ops_to_do =
Atomic.make 0
|> Multicore_magic.copy_as_padded
(* We also explicitly pad the number of ops to avoid false sharing. *)
in
(* [init] is called on each domain before [work]. The return value of
[init] is passed to [work]. *)
let init _domain_index =
(* It doesn't matter that we set the atomic multiple times. *)
Atomic.set n_ops_to_do n
in
(* [work] is called on each domain and the time it takes is recorded.
The second argument comes from [init]. *)
let work _domain_index () =
(* Because we are benchmarking operations that take a very small amount
of time, we run our own loop to perform the operations. This has
pros and cons. One con is that the loop overhead will be part of the
measurement, which is something to keep in mind when interpreting the
results. One pro is is that this gives more flexibility in various
ways. *)
let rec work () =
(* We try to allocate some number of operations to perform. *)
let n = Util.alloc n_ops_to_do in
(* If we got zero, then we should stop. *)
if n <> 0 then begin
(* Otherwise we perform the operations and try again. *)
for _=1 to n do
Atomic.incr atomic
done;
work ()
end
in
work ()
in
(* [config] is a name for the configuration of the benchmark. In this
case we distinguish by the number of workers or domains. *)
let config =
Printf.sprintf "%d worker%s" n_domains
(if n_domains = 1 then "" else "s")
in
(* [Times.record] does the heavy lifting to spawn domains and measure
the time [work] takes on them. *)
let times = Times.record ~budgetf ~n_domains ~init ~work () in
(* [Times.to_thruput_metrics] takes the measurements and produces both a
metric for the time of a single operation and for the total thruput
over all the domains. *)
Times.to_thruput_metrics ~n ~singular:"incr" ~config times
(* [run_suite] runs the benchmarks in this suite with the given budget. *)
let run_suite ~budgetf =
(* In this case we run the benchmark with various number of domains. We
use [concat_map] to collect the results as a flat list of outputs. *)
[ 1; 2; 4; 8 ]
|> List.concat_map @@ fun n_domains ->
run_one ~budgetf ~n_domains ()
end
module Bench_atomic : sig val run_suite : Suite.t end
We then collect all the suites into an association list. The association list has a name and entry point for each suite:
# let benchmarks = [
("Atomic", Bench_atomic.run_suite)
]
val benchmarks : (string * Suite.t) list = [("Atomic", <fun>)]
Usually the list of benchmarks is in the main module of the benchmark executable along with an invocation of Cmd.run
:
# Cmd.run ~benchmarks ~argv:[||] ()
{
"results": [
{
"name": "Atomic",
"metrics": [
{
"name": "time per incr/1 worker",
"value": 11.791,
"units": "ns",
"trend": "lower-is-better",
"description": "Time to process one incr",
"#best": 9.250000000000002,
"#mean": 12.149960000000002,
"#median": 11.791,
"#sd": 1.851061543655424,
"#runs": 25
},
{
"name": "incrs over time/1 worker",
"value": 84.81044864727335,
"units": "M/s",
"trend": "higher-is-better",
"description": "Total number of incrs processed",
"#best": 108.1081081081081,
"#mean": 84.25129565093134,
"#median": 84.81044864727335,
"#sd": 12.911113376793846,
"#runs": 25
},
// ...
]
}
]
}
- : unit = ()
By default Cmd.run
interprets command line arguments from Sys.argv
. Unlike what one would typically do, we explicitly specify ~argv:[||]
, because this code is being run through the MDX tool.
Note that the output above is just a sample. The timings are non-deterministic and will slightly vary from one run of the benchmark to another even on a single computer.
Dependencies (7)
-
domain_shims
>= "0.1.0"
-
yojson
>= "2.1.0"
-
mtime
>= "2.0.0"
-
multicore-magic
>= "2.1.0"
-
domain-local-await
>= "1.0.1"
-
ocaml
>= "4.13.0"
-
dune
>= "3.8"
Used by
None
Conflicts
None